- CRX: direct CHARE inference (Algorithm 7, TODS 2010) - iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010) - RWR₀: SORE repair (Algorithm 6, TODS 2010) - rwr²: k-ORE extraction (Algorithm 3, arXiv 2010) - SOA, k-OA, iKoa, 2T-INF, Baum-Welch - Ansible role grammar adapter - Generic YAML key-path converter - 28 tests, all passing
2492 lines
No EOL
122 KiB
Text
2492 lines
No EOL
122 KiB
Text
Inference of Concise Regular Expressions
|
||
and DTDs
|
||
GEERT JAN BEX and FRANK NEVEN
|
||
Hasselt University and Transnational University of Limburg
|
||
THOMAS SCHWENTICK
|
||
Dortmund University
|
||
and
|
||
STIJN VANSUMMEREN
|
||
Université Libre de Bruxelles
|
||
|
||
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set
|
||
of XML-documents, a problem that basically reduces to learning concise regular expressions from
|
||
positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the
|
||
far majority of expressions used in practical DTDs. For the inference of SOREs we present several
|
||
algorithms that first infer an automaton for a given set of example strings and then translate that
|
||
automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE
|
||
can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available,
|
||
however (for instance when the data is generated by Web service requests or by answers to queries),
|
||
these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel
|
||
learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without
|
||
going through an automaton representation. We show that CRX performs very well within its target
|
||
class on very small datasets.
|
||
|
||
This research was done while S. Vansummeren was a Postdoctoral Fellow of the Research
|
||
Foundation-Flanders (FWO) at Hasselt University.
|
||
This work was funded by FWO-G.0821.09N and the Future and Emerging Technologies (FET)
|
||
programme within the Seventh Framework Programme for Research of the European Commision,
|
||
under the FET-Open grant agreement FOX, number FP7-ICT-233599.
|
||
Authors’ addresses: G. J. Bex and F. Neven, Database and Theoretical Computer Science Research Group, Hasselt University and Transnational University of Limburg, Agoralaan, gebouw D,
|
||
B-3590 Diepenbeek Belgium; email: {geertjan.bex, frank.neven}@uhasselt.be; T. Schwentick, TU
|
||
Dortmund, Fakultät für Informatik, Otto-Hahn-Str. 16, Raum 214, 44227 Dortmund, Germany.
|
||
email: thomas.schwentick@udo.edu; S. Vansummeren, Research Laboratory for Web and Information Technologies (WIT), Université Libre de Bruxelles, 50 Av. F. Roosevelt, CP 165/15 B-1050
|
||
Brussels, Belgium; email: stijn.vansummeren@ulb.ac.be.
|
||
Permission to make digital or hard copies of part or all of this work for personal or classroom use
|
||
is granted without fee provided that copies are not made or distributed for profit or commercial
|
||
advantage and that copies show this notice on the first page or initial screen of a display along
|
||
with the full citation. Copyrights for components of this work owned by others than ACM must be
|
||
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
|
||
to redistribute to lists, or to use any component of this work in other works requires prior specific
|
||
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
|
||
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
|
||
2010 ACM 0362-5915/2010/04-ART11 $10.00
|
||
C
|
||
DOI 10.1145/1735886.1735890 http://doi.acm.org/10.1145/1735886.1735890
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11
|
||
|
||
11:2
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Categories and Subject Descriptors: F.4.3 [Mathematical Logic and Formal Languages]:
|
||
Formal Languages; H.2.1 [Database Management]: Logical Design; I.2.6 [Artificial Intelligence]: Learning; I.7.2 [Document and Text Processing]: Document Preparation
|
||
General Terms: Algorithms, Languages, Theory
|
||
Additional Key Words and Phrases: Regular expressions, schema inference, XML
|
||
ACM Reference Format:
|
||
Bex, G. J., Neven, F., Schwentick, T., and Vansummeren, S. 2010. Inference of concise regular
|
||
expressions and DTDs. ACM Trans. Datab. Syst, 35. 2, Article 11 (April 2010), 47 pages.
|
||
DOI = 10.1145/1735886.1735890 http://doi.acm.org/10.1145/1735886.1735890
|
||
|
||
1. INTRODUCTION
|
||
The eXtensible Markup Language (XML) serves as the lingua franca for data
|
||
exchange on the Internet [Abiteboul et al. 1999]. Because XML documents
|
||
in general can be of any form, most communities and applications impose
|
||
structural constraints on the documents that are to be exchanged or processed.
|
||
These constraints can be formally specified in a schema, which is written in a
|
||
schema language such as the Document Type Definitions (DTDs) or the XML
|
||
Schema Definitions (XSDs) [Thompson et al. 2004].
|
||
The advantages offered by the presence of a fully specified schema are
|
||
numerous. First and foremost, a schema allows automatic validation of the
|
||
input document structure, which not only facilitates automatic processing but
|
||
also ensures soundness of the input. Unvalidated input data from Web requests
|
||
is considered as the number one vulnerability for Web applications [Open Web
|
||
Application Security Project Consortium 2004]. The presence of a schema also
|
||
allows for automation and optimization of search, integration, and processing
|
||
of XML data (refer to, e.g., Benedikt et al. [2008], Deutsch et al. [1999], Koch
|
||
et al. [2004], Manolescu et al. [2001], Neven and Schwentick [2006], Wang
|
||
et al. [2003]). Moreover, various software development tools such as Castor
|
||
[Castor] and SUN’s JAXB [Sun] rely on schemas to perform object-relational
|
||
mappings for persistence. Furthermore, the existence of schemas is imperative
|
||
when integrating (meta) data through schema matching [Rahm and Bernstein
|
||
2001] and in the area of generic model management [Bernstein 2003; Melnik
|
||
2004]. A final advantage of a schema is that it assigns meaning to the data.
|
||
That is, it provides a user with a concrete semantics of the document and
|
||
aids in the specification of meaningful queries over XML data. Although the
|
||
examples mentioned here just scrape the surface of current applications,
|
||
they already underscore the importance of schemas accompanying XML
|
||
data.
|
||
Unfortunately, in spite of the aforementioned advantages, the presence of
|
||
a schema is not mandatory and many XML documents are not accompanied
|
||
by one. For instance, in a recent study Mignet et al. [2003] and Barbosa et al.
|
||
[2006] have shown that approximately half of the XML documents available
|
||
on the Web do not refer to a schema. In another study Bex et al. [2004] and
|
||
Martens et al. [2006] have noted that about two-thirds of XSDs gathered from
|
||
schema repositories and from the Web are not valid with respect to the W3C
|
||
XML Schema specification [Thompson et al. 2004], rendering them essentially
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:3
|
||
|
||
useless for immedidate application. A similar observation was made by
|
||
Sahuguet [2000] concerning DTDs.
|
||
Based on the lack of schemas in practice, it is essential to devise algorithms
|
||
that can infer a schema for a given collection of XML documents when none, or
|
||
no syntactically correct one, is present. This is also acknowledged by Florescu
|
||
[2005] who emphasizes that in the context of data integration:
|
||
“We need to extract good-quality schemas automatically from existing data and perform incremental maintenance of the generated
|
||
schemas.”
|
||
In this article, we describe two novel schema inference algorithms outperforming existing systems in accuracy, conciseness, and speed.
|
||
It should be noted that even when a schema is already available, there
|
||
are situations where inference can be useful. One such situation is schema
|
||
cleaning: sometimes a schema is too general with respect to the XML data
|
||
that it is supposed to describe. In that case, it can be advantageous to infer a new schema based solely on the data at hand. This situation is nicely
|
||
illustrated by the following real-world example taken from the Protein Sequence Database DTD [Miklau 2002], which gives the following definition for
|
||
the refinfo-element.
|
||
authors, citation, volume?, month?, year,
|
||
pages?, (title | description)?, xrefs?
|
||
An analysis of the available XML corpus (683MB of data) with our inference
|
||
algorithms yields following more precise expression for the refinfo-element.
|
||
authors, citation, (volume | month), year,
|
||
pages?, (title | description)?, xrefs?
|
||
Note that the latter is more strict than the former, as it emphasizes that volume
|
||
and month do not occur together: either one specifies a month of publication for
|
||
a given journal article, or the volume that it has appeared in, but not both.
|
||
As this example illustrates, schema inference algorithms can hence be used to
|
||
better understand the semantics of a given XML dataset, making it possible to
|
||
adapt an existing schema when necessary. In general, schema inference can be
|
||
used to restrict schemas to a relevant subset of data needed by the application
|
||
at hand, thereby facilitating difficult tasks like schema matching and data
|
||
integration. Indeed, as argued by Hinkelman [2005], industry-level standards
|
||
are too loosely defined in general, which can result in XML schemas where
|
||
many business structures are formally specified as being optional.
|
||
The second situation where schema inference is useful even though a schema
|
||
already exists is in the presence of noisy XML data. In such a situation, part or
|
||
all of the data that needs to be processed is rejected by the existing schema. For
|
||
instance, we have harvested and investigated a corpus of XHTML documents
|
||
from the Web and found that an astonishing 89% of 2092 documents was not
|
||
valid with respect to the XHTML Transitional specification [W3C 2002]. In this
|
||
case, the inference of a new schema based on the corpus and its comparison
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:4
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 1. An example DTD.
|
||
|
||
with the XHTML Transitional specification provides a uniform view of the kind
|
||
of errors made. Further, given that one often has no choice but to deal with such
|
||
noisy data, one may infer a new schema from a subset of the corpus (deleting
|
||
documents that make unacceptable errors) and work with that schema rather
|
||
than with the official specification to retain at least a minimal validation.
|
||
1.1 Problem Setting
|
||
Based on the previous observations, it is hence essential to devise algorithms
|
||
that can automatically infer a DTD or XSD from a given corpus of XML
|
||
documents.
|
||
As illustrated in Figure 1, a DTD is essentially a mapping d from element
|
||
names to regular expressions over element names. An XML document is valid
|
||
with respect to d if for every occurrence of an element name e in the document,
|
||
the word formed by its children belongs to the language of the corresponding
|
||
regular expression d(e). For instance, the DTD in Figure 1 requires each store
|
||
element to have zero or more order children, which must be followed by a
|
||
stock element. Likewise, each order must have a customer child, which must
|
||
be followed by one or more item elements.
|
||
To infer a DTD from a corpus of XML documents C it hence suffices to look,
|
||
for each element name e that occurs in a document in C, at the set of element
|
||
name words that occur below e in C, and to infer from this set the corresponding
|
||
regular expression d(e). As such, the inference of DTDs reduces to the inference of regular expressions from sets of positive example words. To illustrate,
|
||
from the words id price, id qty supplier, and id qty item item appearing under <item> elements in a sample XML corpus, we could derive the following
|
||
rule.
|
||
item → (id, price | (qty, (supplier | item+ )))
|
||
While the inference of XSDs is more complicated than the inference of DTDs,
|
||
recent characterizations [Martens et al. 2006] show that the structural core of
|
||
XML schema (that is, the sets of trees that are definable by XSDs) correspond
|
||
to DTDs extended with vertical regular expressions. Therefore, one cannot
|
||
hope to successfully infer XSDs without good algorithms for inferring regular
|
||
expressions. As such, we focus in this article on the inference of regular expressions (and therefore, by the preceding reduction, on the inference of DTDs).
|
||
The inference of XSDs, building on the algorithms presented here, is treated in
|
||
a companion article [Bex et al. 2007].
|
||
In particular, let be a fixed set of alphabet symbols (also called element
|
||
names), and let ∗ be the set of all words over .
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:5
|
||
|
||
Definition 1 (Regular Expressions). In this article, we are interested in
|
||
learning regular expressions r, s of the form
|
||
r, s ::= ∅ | ε | a | r . s | r + s | r? | r + ,
|
||
where parentheses may be added to avoid ambiguity. Here, ε denotes the empty
|
||
word; a ranges over symbols in ; r . s denotes concatenation; r + s denotes
|
||
disjunction; r + denotes one-or-more repetitions; and r? denotes the optional
|
||
regular expression. That is, the language L(r) accepted by regular expression
|
||
r is given by
|
||
L(∅) = ∅
|
||
L(ε) = {ε}
|
||
L(a) = {a}
|
||
L(r . s) = {vw | v ∈ L(r), w ∈ L(s)}
|
||
L(r + s) = L(r) ∪ L(s)
|
||
L(r + ) = {v1 . . . vn | n ≥ 1 and v1 , . . . , vn ∈ L(r)}
|
||
L(r?) = L(r) ∪ {ε}.
|
||
For convenience, we sometimes omit the concatenation symbol, simply writing rs for r.s. Note that the Kleene star operator (denoting zero or more repititions as in r ∗ ) is not allowed by the preceding syntax. This is not a restriction,
|
||
since r ∗ can always be represented as (r + )? or (r?)+ . Conversely, the latter can
|
||
always be rewritten into the former for presentation to the user. Also note that
|
||
the previous syntax uses r + s, to denote disjunction rather than the vertical
|
||
bar notation r | s used by DTDs. The former notation should not be confused
|
||
with the one-ore-more repetition operator r + , where the plus symbol is used in
|
||
the exponent.
|
||
The class of all regular expressions is actually too large for our purposes,
|
||
as both DTDs and XSDs require the regular expressions occurring in them to
|
||
be deterministic (also sometimes called one-unambiguous [Brüggemann-Klein
|
||
and Wood 1998]). Intuitively, a regular expression is deterministic if, without
|
||
looking ahead in the input word, it allows to match each symbol of that word
|
||
uniquely against a position in the expression when processing the input in
|
||
one pass from left to right. For instance, (a + b)∗ a is not deterministic as already the first symbol in the word aaa could be matched by either the first or
|
||
the second a in the expression. Without lookahead, it is impossible to know
|
||
which one to choose. The equivalent expression b∗ a(b∗ a)∗ , on the other hand, is
|
||
deterministic.
|
||
Definition 2. Let r stand for the regular expression obtained from r by
|
||
replacing the ith occurrence of alphabet symbol a in r by a(i) , for every i and
|
||
+
|
||
+
|
||
a. For example, for r = b+ a(ba+ )? we have r = b(1) a(1) (b(2) a(2) )?. A regular
|
||
expression r is deterministic if there are no words wa(i) v and wa( j) v in L(r)
|
||
such that i = j.
|
||
Equivalently, an expression is deterministic if the so-called Glushkov construction [Brüggeman-Klein 1993] translates it into a deterministic finite
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:6
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
automaton rather than a nondeterministic one [Brüggemann-Klein and Wood
|
||
1998]. Not every nondeterministic regular expression is equivalent to a deterministic one [Brüggemann-Klein and Wood 1998]. Thus, semantically, the class
|
||
of deterministic regular expressions forms a strict subclass of the class of all
|
||
regular expressions.
|
||
Learning in the limit. For the purpose of inferring DTDs from XML data,
|
||
we are hence in search of an algorithm that, given enough sample words of a
|
||
target deterministic regular expression r, returns a deterministic expression r
|
||
equivalent to r. In the framework of learning in the limit [Gold 1967], such an
|
||
algorithm is said to learn the deterministic regular expressions from positive
|
||
data.
|
||
Definition 3. Define a sample to be a finite subset of ∗ and let R be
|
||
a subclass of the regular expressions. An algorithm M mapping samples to
|
||
expressions in R is said to learn R from positive data if: (1) S ⊆ L(M(S)) for
|
||
every sample Sand (2) to every r ∈ R we can associate a so-called characteristic
|
||
sample Sr ⊆ L(r) such that, for each sample S with Sr ⊆ S ⊆ L(r), M(S) is
|
||
equivalent to r.
|
||
Intuitively, the first condition says that M must be sound; the second that
|
||
M must be complete, given enough data. A class of regular expressions R is
|
||
learnable in the limit from positive data if an algorithm exists that learns R.
|
||
For the class of all regular expressions, it was shown by Gold [1967] that no
|
||
such algorithm exists. The same holds for the class of deterministic regular
|
||
expressions, as shown in our companion article [Bex et al. 2008].
|
||
PROPOSITION 4 (BEX ET AL. 2008). The class of deterministic regular expressions is not learnable in the limit from positive data.
|
||
Proposition 4 immediately excludes the possibility for an algorithm to infer
|
||
the full class of DTDs. In practice, however, regular expressions occurring in
|
||
DTDs and XSDs are concise rather than arbitrarily complex. Indeed, a study
|
||
of 819 DTDs and XSDs gathered from the Cover Pages [Cover 2003] (including
|
||
many high-quality XML standards) as well as from the Web at large, revealed
|
||
that regular expressions occurring in practical schemas are such that every
|
||
alphabet symbol occurs at most k times, with k small. Actually, in 98% of the
|
||
cases k = 1.
|
||
Definition 5. A regular expression is k-occurrence if every alphabet symbol
|
||
occurs at most k times in it.
|
||
For example, the expressions customer . order+ and (school + institute)+
|
||
are both 1-occurrence, while id .(qty + id) is 2-occurrence (as id occurs twice).
|
||
Observe that if r is k-occurrence, then it is also l-occurrence for every l ≥ k.
|
||
To simplify notation, we often abbreviate “k-occurrence regular expression” by
|
||
k-ORE and also refer to the 1-OREs as “single occurrence regular expressions”
|
||
or SOREs.
|
||
Note that, since every alphabet symbol can occur at most once in a SORE,
|
||
every SORE is necessarily deterministic. Indeed, we have the following strict
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:7
|
||
|
||
inclusion hierarchy among the various classes of regular expressions just
|
||
discussed.
|
||
SOREs
|
||
⊂ 2-OREs ⊂ 3-OREs ⊂ · · · ⊂ k-OREs
|
||
⊂
|
||
⊂
|
||
deterministic regex
|
||
⊂
|
||
all regex
|
||
(For k ≥ 2, the classes of k-OREs and deterministic regular expressions are
|
||
incomparable.) Given their importance in practical schemas, we focus in this
|
||
article on the inference of SOREs. The inference of deterministic k-OREs for
|
||
k > 1 is treated in a companion article [Bex et al. 2008].
|
||
1.2 Outline and Contributions
|
||
In particular, we show in Section 3 that the class of SOREs can be efficiently
|
||
learned in the limit from positive data by first constructing an automaton
|
||
representation of the target SORE using techniques of Garcı́a and Vidal [1990],
|
||
and by subsequently transforming this automaton into an equivalent SORE (if
|
||
such a SORE exists) using a novel polynomial-time algorithm called REWRITE.
|
||
For the general class of regular expressions the resulting expression can be of
|
||
exponential size, as we explain in more detail in Section 3. In Section 4, we
|
||
improve REWRITE to deal with real-world, and therefore incomplete, samples. In
|
||
contrast to REWRITE, which fails when its input automaton is not equivalent to
|
||
a SORE, the resulting improvement, called RWR, repairs the input automaton
|
||
until it becomes equivalent to a SORE. We also develop an extension of RWR,
|
||
called RWR2 , which improves the precision of RWR at the cost of increased running
|
||
time.
|
||
For the settings where extremely little XML data is available to infer a
|
||
schema from (for instance, when the data is returned as answers to queries or
|
||
Web service requests [Ngu et al. 2005; Oaks and ter Hofstede 2007]), we
|
||
introduce in Section 6 the algorithm CRX. CRX successfully learns the class
|
||
of CHAREs, a strict subclass of the SOREs that nevertheless holds great
|
||
practical importance. Indeed, the same investigation as before reveals that
|
||
more than 90% of the regular expressions occurring in practical schemas are
|
||
CHAREs [Martens et al. 2006].
|
||
We experimentally validate RWR, RWR2 , and CRX in Section 7 on both small and
|
||
large samples drawn from real-world target DTDs whose regular expressions
|
||
fall both within the class of SOREs/CHAREs and outside of those classes. In
|
||
all settings, our algorithms outperform existing systems in accuracy, conciseness, and speed. Further, we assess the strong generalization ability of CRX by
|
||
establishing on average the minimal number of sample words needed to derive
|
||
optimal regular expressions. In Section 8 we discuss how to extend RWR and
|
||
CRX to incrementally compute the inferred regular expressions when new data
|
||
arrive, how to address noise, and how to deal with numerical predicates. We
|
||
begin in the next section with a discussion of related work, and conclude in
|
||
Section 9.
|
||
It is important to note that this article differs from its conference version [Bex
|
||
et al. 2006] in the following way.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:8
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
—First and foremost, it corrects the results of Bex et al. [2006] by providing
|
||
a completely new algorithm for converting automata into equivalent SOREs
|
||
(provided such a SORE exists), and gives a full correctness proof (Section 3).
|
||
In contrast to what is claimed in Bex et al. [2006], the conversion algorithm
|
||
of Bex et al. [2006] does not always yield an equivalent SORE, as discussed
|
||
in Section 5.
|
||
—It introduces new heuristics (based on a language size criterion) for dealing
|
||
with real-world, and therefore incomplete datasets (Section 4).
|
||
—It adds new experiments that measure: (1) the impact of noise and (2) the
|
||
accuracy of our algorithms under various levels of missing data.
|
||
2. RELATED WORK
|
||
Schema inference. Schemas for semistructured data have been defined in
|
||
Buneman et al. [1997], Fernandez and Suciu [1998], and McHugh et al.
|
||
[1997] and their inference has been addressed in Goldman and Widom [1997],
|
||
and Nestorov et al. [1997, 1998]. The methods in Nestorov et al. [1997] and
|
||
Goldman and Widom [1997] focus on the derivation of a graph summary
|
||
structure (called full representative object or dataguide) for a semistructured
|
||
database. This data structure contains all paths in the database. Approximations of this structure are considered by restricting to paths of a certain length.
|
||
The latter then basically reduces to the derivation of an automaton from a set
|
||
of bounded length strings. Naively restricting the algorithms to trees rather
|
||
than graphs is inappropriate since no order is considered between the children
|
||
of a node so that DTD-like schemas cannot be derived. However, even the use
|
||
of more sophisticated encodings of the XML documents using edges between
|
||
siblings would be to no avail since no algorithms are given to translate the
|
||
obtained automata to regular expressions. In Nestorov et al. [1998], a schema
|
||
is a typing by means of a datalog program. Again, no algorithms are given
|
||
to transform datalog types into regular expressions. So, these approaches
|
||
can therefore not be used to derive DTDs, not even when the semistructured
|
||
database is tree-shaped.
|
||
DTD inference. In the context of DTD inference, Sankey and Wong [2001]
|
||
propose several approaches to generate probabilistic string automata to represent regular expressions. To transform these into actual regular expressions,
|
||
and hence to obtain DTDs, the authors refer to the methods of Ahonen [1996].
|
||
The latter provides a method to translate one-unambiguous nonprobabilistic
|
||
string automata to regular expressions, as given by Brüggemann-Klein and
|
||
Wood [1998], followed by a post-processing simplification step. Apart from several case analyses based on a dictionary example, no systematic study of the
|
||
effectiveness of the approach is provided. In particular, in contrast to our results, no target class is given for which the set of transformations is complete.
|
||
There are only a few papers describing systems for direct DTD inference
|
||
[Garofalakis et al. 2003; Min et al. 2003; Chidlovskii 2001]. Only one of them is
|
||
available for testing: XTRACT [Garofalakis et al. 2003]. In Section 7, we make a
|
||
detailed comparison with our proposal. In contrast to our approach, the XTRACT
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:9
|
||
|
||
system generates for every separate string a regular expression while representing repeated subparts by introducing Kleene-*. In a second step, the system
|
||
factorizes common subexpressions of these candidate regular expressions using algorithms from the logic optimization literature. Finally, in the third step,
|
||
XTRACT applies the Minimum Description Length (MDL) principle to find the
|
||
best RE among the candidates. Although the approach has been shown to work
|
||
on real-world DTDs in Garofalakis et al. [2003] the XML data complying to
|
||
these DTDs was generated. We report in Section 7 that XTRACT has two kinds of
|
||
shortcomings on real-world XML data: (1) it generates large, long-winded, and
|
||
difficult to interpret regular expressions; and (2) it cannot handle large datasets (over 1000 strings). The latter is due to the NP-hard submodule in the
|
||
third step of the XTRACT algorithm [Fernau 2004]. The former problem seems
|
||
to be more fundamental. The final step results in expressions consisting of
|
||
disjunctions of regular expressions while in practice the large majority of regular expressions are concatenations of disjunctions [Martens et al. 2006]. As a
|
||
result, larger datasets result in larger regular expressions.
|
||
In Min et al. [2003] an adaptation of the XTRACT approach to a restricted
|
||
class of regular expressions which form a subclass of SOREs is described. Although the system, according to the experiments conducted in Min et al. [2003],
|
||
outperforms XTRACT in accuracy and efficiency, it seems that the two fundamental shortcomings described earlier remain. It would thus be surprising if the
|
||
system performed much better than XTRACT on real-world data. Similarly to
|
||
Ahonen [1996], the approach of Chidlovskii [2001] relies on the translation of
|
||
Glushkov automata to regular expressions which, in general, can lead to an
|
||
exponential size increase.
|
||
Trang [Clark ] is state-of-the-art software written by James Clark intended
|
||
as a schema translator for the schema languages DTDs, Relax NG, and XML
|
||
Schema. In addition, Trang allows to infer a schema for a given set of XML
|
||
documents. We discuss Trang further in Section 7.1.
|
||
Language inference. Learning of regular languages from positive examples in
|
||
the computational learning community is mostly directed towards inference of
|
||
automata as opposed to inference of regular expressions [Angluin and Smith
|
||
1983; Pitt 1989; Sakakibara 1997]. As noted by Fernau [2004] and argued
|
||
in the previous section, first using learning algorithms for deterministic automata and then transforming these into regular expressions in general leads
|
||
to unmanageable and long-winded regular expressions. Some approaches to
|
||
inference of regular expressions for restricted cases have been considered. For
|
||
instance, Brāzma [1993] showed that regular expressions without union can
|
||
be approximately learned in polynomial time from a set of examples satisfying
|
||
some criteria. Fernau [2009] provided a learning algorithm for finite unions
|
||
of pairwise left-aligned union-free regular expressions. These expressions are
|
||
different from the expressions we consider here: they are not included in the
|
||
class of SOREs and do not contain all CHAREs. The development is purely
|
||
theoretical, no experimental validation has been performed.
|
||
Automata to RE translation. Although heuristics for automata to RE translations [Delgado and Morais 2004; Han and Wood 2007] have been proposed,
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:10
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 2. (a) The SOA accepting the same language as the SORE a . b .(c+d+ ). (b) The SOA generated
|
||
by 2T-INF for the sample S = {bacacdacde, cbacdbacde, abccaadcde}.
|
||
|
||
all of them are optimizations of the classical state elimination algorithm. In
|
||
particular, they investigate the best order to eliminate states when going from
|
||
automata to regular expressions. So, they focus on the class of all automata
|
||
for which, as explained in Section 3, an exponential increase in size cannot be
|
||
avoided in general. Further, the methods remain theoretical as no experimental
|
||
analysis has been performed. Caron and Ziadi [2000] devise an algorithm deciding whether an automaton is Glushkov. If so, the automaton can be rewritten
|
||
into a short equivalent regular expression. Their method works in a top-down
|
||
fashion, that is, it derives the top nodes of the parse tree corresponding to
|
||
the regular expression first, and subsequently proceeds downward in the tree.
|
||
Consequently, the method first derives the largest subexpressions of the expression, making it harder to devise heuristics in the presence of missing data.
|
||
In contrast, our approach is bottom-up, that is, starting from the leaf nodes of
|
||
the parse tree, composing them into the smallest subexpressions.
|
||
3. A COMPLETE ALGORITHM FOR INFERRING SORES
|
||
Our goal in this section is to infer a SORE s equivalent to a target SORE r
|
||
given only a finite sample S ⊆ L(r). To this end, we first learn from S a Single
|
||
Occurrence Automaton (SOA for short). A SOA is a specific kind of deterministic
|
||
finite state automaton in which all states, except for the initial and final state,
|
||
are element names. Figure 2(a) gives an example. Note that in contrast to the
|
||
classical definition of automata, no edges are labeled: all incoming edges in a
|
||
state a are assumed to be labeled by a. As such, a word a1 , . . . , an is accepted if
|
||
there is an edge from the initial state to a1 , an edge from a1 to a2 ,. . . , and an
|
||
edge from an to the final state. Thus, the SOA in Figure 2(a) accepts the same
|
||
language as a . b .(c + d+ ).
|
||
Definition 6 (SOA). Let src and sink be two special symbols, distinct from
|
||
the element names, that will serve as the initial and final state, respectively. A
|
||
single occurrence automaton is a finite directed graph G = (V, E) such that:
|
||
(1) {src, sink} ⊆ V and all nodes in V − {src, sink} are element names; and
|
||
(2) src has only outgoing edges; sink has only incoming edges; and every v ∈
|
||
V − {src, sink} is visited during a walk from src to sink.
|
||
Note that V − {src, sink} can be empty. We write L(G) for the set of all words
|
||
accepted by G; V(G) for the set of G’s vertices, and E(G) for G’s edge relation.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:11
|
||
|
||
Algorithm 1. 2T-INF
|
||
Input: a finite set of sample strings S
|
||
Output: a SOA G such that S ⊆ L(G)
|
||
1: Let V be the set of states consisting of all element names occurring in S plus the
|
||
initial state src and final state sink
|
||
2: Initialize E := ∅
|
||
3: for each string a1 . . . an in S do
|
||
4:
|
||
add the edges (src, a1 ), (a1 , a2 ), . . . , (an, sink) to E
|
||
5: end for
|
||
6: return G = (V, E)
|
||
|
||
3.1 Learning an Automaton
|
||
Given a sample S, we can learn an automaton G that accepts all words in S by
|
||
means of the algorithm 2T-INF shown in Algorithm 1. Its behavior is illustrated
|
||
in Figure 2(a) on the sample S = {abc, abdd} and in Figure 2(b) on the sample
|
||
S = {bacacdacde, cbacdbacde, abccaadcde}. 2T-INF was introduced by Garcı́a and
|
||
Vidal [1990], who also proved the following proposition.
|
||
PROPOSITION 7 ([GARCÍA AND VIDAL 1990]). 2T-INF is sound, that is, S ⊆
|
||
L(2T-INF(S)) for each sample S. Moreover, 2T-INF is minimal, that is, for each SOA
|
||
G with S ⊆ L(G), 2T-INF(S) is a subgraph of G and hence L(2T-INF(S)) ⊆ L(G).
|
||
It turns out that 2T-INF is also complete for building a SOA representation of
|
||
a target SORE r, provided that its input sample is representative with regard
|
||
to r.
|
||
Definition 8 (Representative Sample). A word v of length 2 is said to be a
|
||
2-gram of a set of words W if it occurs as a subword in some w ∈ W. A sample
|
||
S is representative of a SORE r if S ⊆ L(r) and the following statements hold:
|
||
(1) for every a ∈ starting a word in L(r) there is a word in S that starts with
|
||
a;
|
||
(2) for every a ∈ ending a word in L(r) there is a word in S that ends with a;
|
||
(3) every 2-gram of L(r) is a 2-gram of S.
|
||
If S is not representative of r, then we say that S does not cover r.
|
||
For instance, the sample {a, b, c} is representative for a + b + c but {a, c}
|
||
is not since it lacks a word starting with b. Furthermore, the sample
|
||
{bacacdacde, cbacdbacde, abccaadcde} is representative for ((b?(a + c)+ )d)+ e but
|
||
{bacacdacde, cbacdbacde} is not since it does not contain the 2-gram ab.
|
||
PROPOSITION 9.
|
||
L(r).
|
||
|
||
If S is a representative sample of SORE r then L(2T-INF(S)) =
|
||
|
||
PROOF. It is not hard to see that every SORE r can be transformed into an
|
||
equivalent SOA Gr : we take as nodes of Gr all element names occurring in r
|
||
plus the initial state src and the final state sink; for each alphabet symbol that
|
||
starts a word in L(r) we add the edge (src, a) to Gr ; for each alphabet symbol
|
||
that ends a word in L(r) we add an edge (a, sink) to Gr , and for each alphabet
|
||
symbol b that follows an alphabet symbol a in a word in L(r) we add the edge
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:12
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 3. A SOA not equivalent to any SORE. It accepts the same language as a(ba)+ .
|
||
|
||
(a, b) to Gr . Now reason as follows. Clearly, S ⊆ L(r) = L(Gr ). Hence, 2T-INF(S)
|
||
is a subgraph of Gr by Proposition 7. Since S is a representative sample of r,
|
||
however, every edge of Gr must also be in 2T-INF(S). As such, 2T-INF(S) = Gr and
|
||
hence L(2T-INF(S)) = L(Gr ).
|
||
3.2 From SOA to SORE
|
||
Proposition 9 shows that it is possible to learn a SOA representation of a target
|
||
SORE r, provided that we are given enough data. To transform this SOA into
|
||
a regular expression, an obvious approach would be to use known techniques
|
||
such as the classical state elimination algorithm (refer to, e.g., Hopcroft and
|
||
Ullman [1979]). Unfortunately, as already hinted upon by Fernau [2004, 2009]
|
||
and as we illustrate shortly, it is very difficult to get concise regular expressions
|
||
from an automaton representation. For instance, the classical state elimination
|
||
algorithm applied to the SOA generated by 2T-INF in Figure 2(b) yields the
|
||
expression:1
|
||
(aa∗ d + (c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d) + (b + aa∗ b + (c +
|
||
aa∗ c)(c + aa∗ c)∗ (b + aa∗ b))(aa∗ b + (c + aa∗ c)(c + aa∗ c)∗
|
||
(b + aa∗ b))∗ (aa∗ d + (c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d)))(aa∗ d +
|
||
(c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d) + (b + aa∗ b + (c + aa∗ c)(c +
|
||
aa∗ c)∗ (b + aa∗ b))(aa∗ b + (c + aa∗ c)(c + aa∗ c)∗ (b + aa∗ b))∗
|
||
|
||
which differs quite a bit from the equivalent SORE
|
||
((b?(a + c))+ d)+ e
|
||
|
||
(‡).
|
||
|
||
Actually, results by Ehrenfeucht and Zeiger [1976], Gelade and Neven [2008],
|
||
and Gruber and Holzer [2008] show that it is impossible in general to generate
|
||
concise regular expressions from automata: there are automata, even SOAs as
|
||
generated by 2T-INF, for which the number of occurrences of alphabet symbols in
|
||
the smallest equivalent expression is exponential in the size of the automaton.
|
||
For such automata, a concise regular expression representation hence does not
|
||
exist.
|
||
These results imply that there are SOAs G for which an equivalent SORE
|
||
does not exist (Figure 3 gives a simple example). Note, however, that when
|
||
such a SORE r does exist, its size is always linearly bounded by the number of
|
||
states of G. Indeed, since every alphabet symbol can occur at most once in r, the
|
||
size of r is linearly bounded by the alphabet symbols that it mentions. Since G
|
||
and r are equivalent, these symbols are exactly the states of G (minus src and
|
||
sink). Hence, the SOREs constitute a well-behaved and concisely representable
|
||
subset of the regular languages. It is therefore natural to investigate how to
|
||
1 Transformation computed by JFLAP: www.jflap.org.
|
||
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:13
|
||
|
||
transform a given SOA into an equivalent SORE when such a SORE exists.
|
||
Clearly, the previous example illustrates that the classical state elimination
|
||
algorithm does not suffice for this purpose.
|
||
For that reason, we introduce in this section a novel graph-rewriting approach for transforming SOAs into SOREs. While our approach is related to the
|
||
classical state-elimination algorithm for transforming an arbitrary automaton
|
||
into a regular expression, we do not eliminate states by introducing additional
|
||
edges (thereby duplicating subexpressions) but instead replace sets of states
|
||
by single states (taking care to avoid duplication). In addition, there are two
|
||
rewriting steps that only remove edges.
|
||
Just as the classical algorithm, it is necessary for the definition of the graph
|
||
rewrite rules to define a generalization of SOAs in which internal states are
|
||
allowed to be labeled by SOREs (as opposed to element names from ). This generalization is defined as follows. Call two regular expressions r and s alphabetdisjoint if r and s have no alphabet symbol in common. For example, (a+b)? and
|
||
c+ are alphabet-disjoint, whereas (a + b) and b?c+ are not. Call an expression
|
||
r proper if it accepts at least one nonempty word (i.e., it is not equivalent to ∅,
|
||
nor to ε).
|
||
Definition 10. A generalized Single Occurrence Automaton (generalized
|
||
SOA for short) is a finite graph G = (V, E) such that:
|
||
(1) {src, sink} ⊆ V and all vertices in V − {src, sink} are pairwise alphabetdisjoint proper SOREs; and
|
||
(2) the edge relation E is such that src has only outgoing edges; sink has only
|
||
incoming edges; and every v ∈ V is visited by a walk from src to sink.
|
||
A word w ∈ ∗ is accepted by G if there is a walk src r1 . . . rm sink in G and a
|
||
division of w into subwords w = w1 . . . wm such that wi ∈ L(ri ), for 1 ≤ i ≤ m.
|
||
Again, we write L(G) for the set of all words accepted by G.
|
||
Figure 7 shows some examples. Clearly, every SOA is also a generalized
|
||
SOA. In what follows, we write PredG (s) for the set of all direct predecessors of
|
||
a SORE s in G, and SuccG (s) for the set of all direct successors of s in G.
|
||
PredG (s) := {r | (r, s) ∈ E(G)},
|
||
SuccG (s) := {t | (s, t) ∈ E(G)}.
|
||
−
|
||
Furthermore, we write Pred−
|
||
G (s) for PredG (s) − {s} and similarly SuccG (s) for
|
||
SuccG (s) − {s}. Finally, we write
|
||
|
||
PredG (s) ∪ {s} if s = s + for some s
|
||
+
|
||
PredG (s) :=
|
||
PredG (s)
|
||
otherwise
|
||
|
||
SuccG (s) ∪ {s} if s = s + for some s
|
||
(s)
|
||
:=
|
||
Succ+
|
||
G
|
||
SuccG (s)
|
||
otherwise.
|
||
|
||
Rewrite rules. Our system of rewrite rules consists of the seven rules shown
|
||
in Figures 4–6: one rule to introduce disjunction (r + s), four rules to introduce
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:14
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 4. Rewrite rules part 1. In the illustrations, P is the set PredG (r)−{r, s}. Sis the set SuccG (s)−
|
||
+
|
||
{r, s}. The gray loops on r and s indicate that r ∈ Succ+
|
||
G (r) and s ∈ SuccG (s), respectively.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:15
|
||
|
||
Fig. 5. Rewrite rules part 2. In the illustrations, P is the set PredG (r)−{r, s}. Sis the set SuccG (s)−
|
||
+
|
||
{r, s}. The gray loops on r and s indicate that r ∈ Succ+
|
||
G (r) and s ∈ SuccG (s), respectively.
|
||
|
||
concatenation (r . s, r? . s, r . s?, and r? . s?), one rule to introduce iteration (r + ),
|
||
and one rule to introduce optionals (r?). At the basis of the first five rules lies
|
||
the contraction of two states r and s into a single new state t, which is defined
|
||
as follows.
|
||
Definition 11 (State Contraction). Let G be a generalized SOA; let r and s
|
||
be states in G; and let t be a state not in G. The contraction of r and s into t is
|
||
the generalized SOA G[r, s ⇒ t] obtained from G as follows:
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:16
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 6. Rewrite rules part 3. In the illustrations, P is the set PredG (r)−{r, s}. Sis the set SuccG (s)−
|
||
{r, s}. Note in particular that the rule OPTIONAL r? can only be applied when G contains only one
|
||
node besides src and sink.
|
||
|
||
(1) Add t as a new state to G;
|
||
(2) make every v ∈ PredG (r) − {r, s} a predecessor of t;
|
||
(3) make every w ∈ SuccG (r) − {r, s} a successor of t;
|
||
(4) add a loop t → t if r ∈ SuccG (s); and
|
||
(5) remove r, s and all of their incoming and outgoing edges.
|
||
Note that state contraction is not symmetric.
|
||
To illustrate, the contraction G[a, c ⇒ a + c] of the generalized SOA G in
|
||
Figure 7(a) is shown in Figure 7(b). Similarly, the contraction G[b, a + c ⇒
|
||
b? .(a + c)] of the generalized SOA G in Figure 7(b) is shown in Figure 7(c). Note
|
||
that if r = s, then G[r, s ⇒ t] is simply a substitution of r by the new state t.
|
||
To simplify notation, we simply write G[r ⇒ t] for such contractions in what
|
||
follows.
|
||
In addition to contraction, the rewrite rules also use the following
|
||
operation.
|
||
Definition 12. If G is a generalized SOA and r, s are states in G, then we
|
||
write G (r, s) to denote the generalized SOA obtained from G by removing the
|
||
edge from r to s, if present.
|
||
In what follows, we write G H to indicate that G rewrites to H in a single
|
||
step according to the rewrite rules in Figures 4–6, and G ∗ H to indicate that
|
||
G rewrites to H in zero or more steps.
|
||
The following proposition shows that the rewrite rules are sound.
|
||
PROPOSITION 13. If G is a generalized SOA and G H then H is also a
|
||
generalized SOA and L(G) = L(H).
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:17
|
||
|
||
PROOF. First observe that, since all states in a generalized SOA are pairwise
|
||
alphabet-disjoint proper SOREs, the new states r + s; r . s; r? . s; r . s?; r? . s?; r + ;
|
||
and r? introduced by the rewrite rules in Figures 4–6 must themselves be proper
|
||
SOREs alphabet-disjoint with the remaining states. As such, all states in H
|
||
are pairwise alphabet-disjoint proper SOREs. To show that H is a generalized
|
||
SOA, it hence remains to show that every state in H participates in a walk
|
||
from src to sink. Hereto, we distinguish the following three cases.
|
||
—H = G[r, s ⇒ t] for some t. Then, since G is a generalized SOA, and r and s
|
||
particpate in a walk from src to sink. In particular, there is a walk from src
|
||
to r in G, and a walk from s to sink. Then, by definition of state contraction,
|
||
there is a walk from src to t and from t to sink in H, that is, t participates in
|
||
a walk from src to sink in H.
|
||
—H = G[r ⇒ r + ] (r + , r + ). Then, by definition of state contraction and since
|
||
r participates in a walk from src to sink in G, r + must participate in a
|
||
walk from src to sink in G[r ⇒ r + ]. This walk can always be transformed
|
||
into a walk from src to sink in H by removing the edge (r + , r + ) should it
|
||
occur.
|
||
—H = G[r ⇒ r?] (src, sink). Then, by definition of state contraction and since
|
||
r participates in a walk from src to sink in G, r? must participate in a walk
|
||
from src to sink in G[r ⇒ r?]. Since the edge (src, sink) cannot occur in this
|
||
walk (recall that src has no incoming edges and sink has no outgoing edges),
|
||
r? also participates in a walk from src to sink in H.
|
||
To see that L(G) = L(H) we reason by a case analysis on the rewrite rule used
|
||
to transform G into H. For economy of space, we only illustrate this reasoning
|
||
for DISJUNCTION r + s; the other cases are similar.
|
||
So, suppose that G was rewritten into H by DISJUNCTION r + s, that is, H =
|
||
G[r, s ⇒ r+s]. Then r and s have the same (extended) predecessor and successor
|
||
set. From this, it follows that the following statements are equivalent.
|
||
(1) s ∈ SuccG (r);
|
||
(2) r ∈ SuccG (s);
|
||
(3) s ∈ Succ+
|
||
G (s);
|
||
(4) r ∈ Succ+
|
||
G (r).
|
||
For instance, s ∈ SuccG (r) ⇔ r ∈ SuccG (s) since:
|
||
s ∈ SuccG (r) ⇔ s ∈ SuccG (r) ∪ {r}
|
||
⇔ s ∈ Succ+
|
||
G (r)
|
||
+
|
||
⇔ s ∈ SuccG (s)
|
||
⇔ s ∈ Pred+
|
||
G (s)
|
||
+
|
||
⇔ s ∈ PredG (r)
|
||
|
||
since r = s
|
||
by definition of Succ+
|
||
G (r)
|
||
+
|
||
since Succ+
|
||
G (r) = SuccG (s)
|
||
+
|
||
by definition of Succ+
|
||
G (s) and PredG (s)
|
||
+
|
||
since Pred+
|
||
G (r) = PredG (s)
|
||
|
||
⇔ s ∈ PredG (r) ∪ {r}
|
||
⇔ s ∈ PredG (r)
|
||
|
||
by definition of Pred+
|
||
G (r)
|
||
since r = s
|
||
|
||
⇔ r ∈ SuccG (s)
|
||
|
||
by definition of PredG (r) and SuccG (s)
|
||
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:18
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
The other equivalences can be similarly obtained. From these equivalences,
|
||
it follows that G must take one the two forms illustrated for rewrite rule
|
||
DISJUNCTION r + s in Figure 4. In both cases, the corresponding H is also shown.
|
||
Now suppose that w = w1 . . . wm ∈ ∗ is recognized by the walk src, t1 , . . . ,
|
||
tm, sink in G with wi ∈ L(ti ) for 1 ≤ i ≤ m. Let the sequence src, t1 , . . . , tm, sink
|
||
be obtained from src, t1 , . . . , tm, sink by replacing every occurrence of r and s by
|
||
r + s. By inspection of the illustrations for rule DISJUNCTION r + s in Figure 4 it
|
||
is not difficult to see that src, t1 , . . . , tm, sink is a walk in H. Moreover, wi ∈ L(ti )
|
||
by construction for 1 ≤ i ≤ m. Therefore, w ∈ L(H) and hence L(G) ⊆ L(H).
|
||
Conversely, suppose that w = w1 . . . wm ∈ ∗ is recognized by src, t1 , . . . , tm, sink
|
||
in H with wi ∈ L(ti ) for 1 ≤ i ≤ m. Determine vi as follows:
|
||
⎧
|
||
⎪
|
||
⎨ti if ti = r + s
|
||
ti = r if ti = r + s and wi ∈ L(r)
|
||
⎪
|
||
⎩
|
||
s if ti = r + s and wi ∈ L(s)
|
||
By inspection of the illustrations for rule DISJUNCTION r + s in Figure 4 it is
|
||
not difficult to see that src, t1 , . . . , tm, sink is a walk in G. Moreover, wi ∈ L(ti )
|
||
for 1 ≤ i ≤ m. Therefore w ∈ L(G) and hence L(H) ⊆ L(G). As such, L(G) =
|
||
L(H).
|
||
Since each rewrite rule either contracts two states into a single state or
|
||
removes an edge from G, the size of H is always smaller than G. Therefore, we
|
||
have the next proposition.
|
||
PROPOSITION 14. The system of rewrite rules in Figures 4–6 is terminating:
|
||
there is no infinite sequence of rewrite steps G H I . . .
|
||
Our algorithm REWRITE, shown in Algorithm 2, then operates as follows. First,
|
||
it checks whether the input SOA G corresponds to the empty language (∅) or
|
||
the empty word (ε) in lines 1–5. If so, it returns the corresponding regular
|
||
expression. Otherwise, it rewrites G until no further rules apply. It then checks
|
||
whether the resulting generalized SOA is final.
|
||
Definition 15. As generalized SOA G is final if E(G) = {(src, r), (r, sink)}
|
||
with r distinct from src and sink. In other words, G is final if it is a chain
|
||
consisting of the source, an arbitrary regular expression, and the sink.
|
||
If the resulting generalized SOA is indeed final, then clearly L(G) = L(r),
|
||
and r is returned as result. If the resulting generalized SOA is not final, then
|
||
G is not equivalent to a SORE (as we formally show further on), and REWRITE
|
||
fails. To illustrate, Figure 7 shows an example run of REWRITE on the example
|
||
SOA from Figure 2(b).
|
||
THEOREM 16. On input SOA G, REWRITE fails if and only if G is not equivalent
|
||
to a SORE. Otherwise, REWRITE returns a SORE equivalent to G. Moreover,
|
||
5
|
||
REWRITE operates in time O(n ) where n is the number of states in G.
|
||
Note that the complexity O(n5 ) is reasonable since when we apply REWRITE to
|
||
the result of 2T-INF on a sample S, n corresponds to the (typically small) number
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:19
|
||
|
||
Algorithm 2. REWRITE
|
||
Input: a SOA G
|
||
Output: a SORE r such that L(r) = L(G)
|
||
1: if sink is not reachable from src in G then
|
||
2:
|
||
return ∅
|
||
3: else if E(G) = {(src, sink)} then
|
||
4:
|
||
return ε
|
||
5: else
|
||
6:
|
||
while a rewrite rule from Figures 4–6 can be applied do
|
||
7:
|
||
perform the rewrite rule on G
|
||
8:
|
||
end while
|
||
9:
|
||
if G is final then
|
||
10:
|
||
return the corresponding regular expression
|
||
11:
|
||
else
|
||
12:
|
||
fail
|
||
13:
|
||
end if
|
||
14: end if
|
||
|
||
of distinct element names occurring in S, not the total number or total length
|
||
of words in S.
|
||
The remainder of this section is devoted to the proof of Theorem 16, which
|
||
is divided into three steps. First, we show that REWRITE is sound.
|
||
PROPOSITION 17. If REWRITE(G) does not fail then it returns a SORE equivalent to G, for any SOA G.
|
||
PROOF.
|
||
|
||
We distinguish three cases.
|
||
|
||
(1) If sink is not reachable from src then REWRITE(G) = ∅ (clearly a SORE) and
|
||
L(G) = ∅ = L(∅), as desired.
|
||
(2) If E(G) = {(src, sink)} then REWRITE(G) = ε (again clearly a SORE), and
|
||
L(G) = {ε} = L(ε), as desired.
|
||
(3) Otherwise, G is rewritten into a final generalized SOA H with E(H) =
|
||
{(src, t), (t, sink)} (t distinct from src and sink) and REWRITE(G) = t. In
|
||
particular, t is a SORE. By Proposition 13, L(G) = L(H) and thus, since
|
||
E(H) = {(src, t), (t, sink)}, L(G) = L(H) = L(t) = L(REWRITE(G)), as desired.
|
||
Next, we show that REWRITE has the claimed complexity.
|
||
PROPOSITION 18. REWRITE operates in time O(n5 ), where n is the number of
|
||
states of its input G.
|
||
PROOF. We assume that checking whether there is an edge from state r
|
||
to state s can be done in constant time (for instance, using an adjacency matrix representation). To see that REWRITE runs in time O(n5 ) under this assumption, let us check that lines 1–4, lines 6–7, and lines 8–10 all run in
|
||
O(n5 ).
|
||
(Lines 1–4). Since G has at most n2 edges, checking whether sink is reachable
|
||
from src can be done in time O(n2 ) using depth-first search. Moreover, checking
|
||
whether E(G) = {(src, sink)} can also be done in time O(n2 ).
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:20
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 7. An execution of REWRITE on the example automaton in Figure 2(b). Step (1) applies DISJUNCTION r + s with r = a and s = b. Step (2) applies CONCATENATION r? . s with r = b and s = a + c. Step
|
||
(3) applies ITERATION r + with r = b? .(a+ c). Step (4) applies CONCATENATION r . s with r = (b? .(a+ c))+
|
||
and s = d. Step (5) applies ITERATION r + with r = (b? .(a + c))+ . d. One more application of CON+
|
||
+
|
||
CATENATION r . s with r = ((b? .(a + c)) . d) and s = e (not shown) leads to the resulting expression
|
||
((b? .(a + c))+ . d)+ . e.
|
||
|
||
= G1 , G2 , . . . , Gk is the sequence of generalized
|
||
(Lines 6–7). Suppose that G
|
||
SOAs produced by lines 6–7 when rewriting G = G1 until no further rewrite
|
||
rule applies. Since rewrite rules never introduce new states without also removing a state, every Gi has at most n states. Now reason as follows.
|
||
since the automaton
|
||
—The rule for optionals can be applied at most once in G
|
||
that it returns is always final, and since no rewrite rule applies to a final
|
||
generalized SOA. Checking the preconditions of the rule for optionals can be
|
||
done in time O(n2 ), and its action can be performed in time O(n). As such, the
|
||
on applying the rewrite rule for optionals is bounded
|
||
total time spent in G
|
||
2
|
||
by O(n ).
|
||
—Since the rewrite rules for disjunction and concatenation contract two states
|
||
into a single one, these rewrite rules can be applied at most n times in
|
||
G.
|
||
Since of all their preconditions can be checked in time O(n4 ) (by iterating
|
||
over all pairs of states r and s in the current automaton Gi and comparing
|
||
Pred(r), Pred(s), Succ(r), and Succ(s) as desired) and since state contraction
|
||
on the rewrite rules for
|
||
can be done in time O(n), the total time spent in G
|
||
disjunction and concatenation is bounded by O(n × n4 ) = O(n5 ).
|
||
—Since the rule for iteration removes the loop of the state to which it is applied,
|
||
and since each generalized SOA contains at most n loops, there can be at most
|
||
n consecutive applications of this rule before another rewrite rule is applied.
|
||
By the preceding remarks, there are at most n applications of the other
|
||
rewrite rules, so the rewrite rule for iteration can be applied at most n2 times
|
||
Since its precondition can be checked in constant time, and since its
|
||
in G.
|
||
on the rewrite rule
|
||
action can be done in time O(n), the total time spent in G
|
||
for iteration is bounded by O(n2 × n) = O(n3 ).
|
||
(Lines 8–11). Finally, checking whether a generalized SOA is final and extracting the corresponding regular expression can be done in time O(n2 ).
|
||
In summary, lines 1–4 run in time O(n2 ), lines 6–7 run in time O(n5 ), and
|
||
lines 8–11 run in time O(n2 ), yielding a total running time of O(n5 ).
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:21
|
||
|
||
Finally, we show that REWRITE(G) fails if and only if G is not equivalent
|
||
to a SORE, or equivalently, that REWRITE(G) does not fail if, and only if, G is
|
||
equivalent to a SORE. This is actually the most involved part of the proof of
|
||
Theorem 16. Proposition 17 already shows that if REWRITE(G) does not fail, then
|
||
G is equivalent to a SORE. Hence, we remain to show the next proposition.
|
||
PROPOSITION 19.
|
||
not fail.
|
||
|
||
If SOA G is equivalent to a SORE, then REWRITE(G) does
|
||
|
||
Essentially, we prove this proposition in two steps. Call a generalized SOA
|
||
proper if L(G) = ∅ and L(G) = {ε}.
|
||
(1) We first show that for any proper SOA G equivalent to a SORE there exists
|
||
a sequence of rewrite steps that ends in a final automaton (Corollary 46).
|
||
(2) In addition, we show that if proper G can be rewritten into a final automaton
|
||
by a particular sequence of rewrite steps, then any sequence of rewrite steps
|
||
on G ends in a final automaton (Corollary 54).
|
||
As such, REWRITE(G) cannot fail when G is equivalent to a SORE: either G is
|
||
not proper, in which case lines 1–4 of Algorithm 2 return a valid expression, or
|
||
G is proper and will hence be rewritten into a final automaton, in which case
|
||
line 9 returns a valid expression. The details may be found in Appendix A.
|
||
3.3 Discussion
|
||
It should be noted that while the result of REWRITE is always a SORE, this
|
||
SORE need not be easy to read (depending on the order of rewriting). For
|
||
instance, it is possible for REWRITE to generate an expression r .(s? . t?)?. Clearly,
|
||
the optional around (s? . t?) is redundant. Removing it leads to the simpler
|
||
r .(s? . t?). For presentation to the user, it is therefore advisable to postprocess
|
||
the result of REWRITE (and its variations in Section 4) using a regular expression
|
||
simplification algorithm.
|
||
4. DEALING WITH MISSING DATA
|
||
The results of Section 3 suggest the following method to infer a SORE from a
|
||
given sample S.
|
||
(1) First, use 2T-INF to learn from S an automaton representation G of the
|
||
target SORE r.
|
||
(2) Next, convert G into a SORE using REWRITE.
|
||
If S is a representative sample of r then G is equivalent to r by Proposition 9.
|
||
Therefore, REWRITE(G) does not fail by Theorem 16, and hence REWRITE(G) is
|
||
equivalent to r.
|
||
Unfortunately, real-world samples are rarely representative. For instance,
|
||
for target r = (a1 +· · ·+an)+ and increasing values of n, it is increasingly unlikely
|
||
that a sample bears witness to each of the n2 2-grams needed to represent r.
|
||
On such nonrepresentative samples, 2T-INF will construct an automaton for
|
||
which L(G) is a strict subset of L(r). In particular, this automaton need not be
|
||
equivalent to a SORE, and REWRITE(G) can fail. Figure 8 shows an example.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:22
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 8. The SOA generated by 2T-INF for the nonrepresentative sample S = {bacacdacde,
|
||
abccaadcde}. The only rewrite rules that can be applied are ITERATION a+ and ITERATION c+ , after which REWRITE gets stuck in a nonfinal automaton and fails.
|
||
|
||
Fig. 9. Repair rules.
|
||
|
||
For that reason, we present in this section two modifications of REWRITE
|
||
that “repair” G when rewriting gets stuck in a nonfinal automaton. The first
|
||
modification, RWR, picks a single repair when rewriting gets stuck, independent
|
||
of how the repair affects G. The second modification, RWR2 , in contrast, considers
|
||
multiple repair strategies and selects the one that extends G in a minimal way.
|
||
The repair rules used by both algorithms are shown in Figure 9. After a repair
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:23
|
||
|
||
Algorithm 3. RWR
|
||
Input: a SOA G
|
||
Output: a SORE r such that L(G) ⊆ L(r) if G is not equivalent to a SORE, and L(G) =
|
||
L(r) otherwise.
|
||
1: if sink is not reachable from src in G then
|
||
2:
|
||
return ∅
|
||
3: else if E(G) = {(src, sink)} then
|
||
4:
|
||
return ε
|
||
5: else
|
||
6:
|
||
while G is not final do
|
||
7:
|
||
if a rewrite rule from Figures 4–6 can be applied then
|
||
8:
|
||
apply the rewrite rule on G
|
||
9:
|
||
else
|
||
10:
|
||
apply a repair rule from Figure 9
|
||
11:
|
||
end if
|
||
12:
|
||
end while
|
||
13:
|
||
return the corresponding regular expression r
|
||
14: end if
|
||
|
||
rule is applied, the automaton necessarily satisfies the precondition of the
|
||
corresponding rewrite rule. Now note the following.
|
||
PROPOSITION 20. Let G be a proper generalized SOA. If G is not final and no
|
||
rewrite rule applies to G, then at least one of the repair rules in Figure 9 applies
|
||
to G.
|
||
PROOF. Since G is proper, it recognizes at least one nonempty word. Clearly,
|
||
this can only happen when src has a successor r distinct from sink. We distinguish two cases.
|
||
—Either r has a successor s distinct from src, sink, and r. Clearly, REPAIR r? . s?
|
||
is then applicable to G.
|
||
—If r does not have such a successor s, then we claim that src has another
|
||
successor t, distinct from src, sink, and r. Indeed, suppose for the purpose
|
||
of contradiction that no such successor exists. Then, since every state in G
|
||
participates in a walk from src to sink, either E(G) = {(src, r), (r, sink)}, or
|
||
E(G) = {(src, r), (r, r), (r, sink)}. In the first case G is final, in the second we
|
||
can rewrite G using ITERATION r + —a contradiction in both cases. As such,
|
||
the claimed t exists. Then, since src ∈ PredG (r) ∩ PredG (t), REPAIR r + t is
|
||
applicable to G.
|
||
As such, we can always apply a repair rule if rewriting gets stuck in a
|
||
nonfinal automaton, after which rewriting can continue.
|
||
4.1 A Greedy Approach: RWR
|
||
An outline of RWR (short for REWRITE with REPAIRS) is shown in Algorithm 3. Like
|
||
REWRITE, it first checks whether its input G is equivalent to ∅ or ε. Otherwise,
|
||
G is rewritten using the rewrite rules in Figures 4–6 until a final automaton is
|
||
reached, arbitrarily selecting a repair rule when rewriting gets stuck. (In our
|
||
implementation we prefer repairs that make small extensions to the language
|
||
of the automaton over repairs that make larger extensions. In particular, we
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:24
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
first check whether there are r and s for which REPAIR r . s? can be applied. Then
|
||
we check whether there are r and s for which REPAIR r? . s can be applied. Next,
|
||
we check for REPAIR r + s and finally for REPAIR r? . s?.)
|
||
Since the repair rules add edges to G, thereby increasing L(G), we may
|
||
conclude the following theorem.
|
||
THEOREM 21. For a SOA G, RWR always produces a SORE r with L(G) ⊆
|
||
L(r). Moreover, if G is equivalent to a SORE, then L(G) = L(r).
|
||
(The second statement follows by Theorem 16.) Combined with Proposition 9,
|
||
we hence obtain the next corollary.
|
||
COROLLARY 22.
|
||
|
||
Let M be the composition of 2T-INF with RWR, that is, M(S) :=
|
||
|
||
RWR(2T-INF(S)). Then M learns the class of SOREs from positive data.
|
||
|
||
4.2 Exploring the Search Space: RWR2
|
||
When rewriting gets stuck, RWR arbitrarily selects a repair rule (perhaps based
|
||
on some ordering of the rules as in our implementation), and discards the others. It should be clear, however, that when different repair rules are applicable,
|
||
one rule may have a smaller impact on the language of the automaton than
|
||
another. For that reason we present in this section a different modification
|
||
of REWRITE that, in contrast to RWR, tries the “best” repair rules when there
|
||
are several candidates. Here, the “best” repair rules are those that add the
|
||
least number of words to the language. Since an automaton defines an infinite
|
||
language in general, it is of course impossible to take all added words into
|
||
account. We therefore only consider the words up to a length n, where n is twice
|
||
the number of alphabet symbols in the automaton. Formally, for a language L,
|
||
let |L≤n| denote the number of words in L of length at most n. Moreover, say
|
||
that generalized SOA H is a repair of generalized SOA G if H is obtained by
|
||
applying a repair rule on G. Then the repairs of the current automaton G are
|
||
ordered according to increasing values of | L(H)≤n|, and the best (i.e., first)
|
||
among them are further investigated.
|
||
The resulting algorithm, called RWR2 (an abbreviation of REWRITE with
|
||
best RANKED REPAIRS) is shown in Algorithm 4. Like REWRITE, it first checks
|
||
whether its input G is equivalent to ∅ or ε. Otherwise, RWR2 uses RWR2 -AUX to
|
||
Algorithm 4. RWR2
|
||
Input: SOA G
|
||
Output: a SORE r such that L(G) ⊆ L(r) if G is not equivalent to a SORE, and L(G) =
|
||
L(r) otherwise.
|
||
1: if sink is not reachable from src in G then
|
||
2:
|
||
return ∅
|
||
3: else if E(G) = {(src, sink)} then
|
||
4:
|
||
return ε
|
||
5: else
|
||
6:
|
||
initialize the final automaton Hopt to recognize (G)∗
|
||
7:
|
||
return the SORE corresponding to the final automaton computed by
|
||
2
|
||
RWR -AUX(G, Hopt )
|
||
8: end if
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:25
|
||
|
||
Algorithm 5. RWR2 -AUX
|
||
Input: generalized SOAs G and Hopt
|
||
Output: final generalized SOA I such that L(G) ⊆ L(I) if G is not equivalent to a
|
||
SORE, and L(G) = L(I) otherwise.
|
||
1: while a rewrite rule from Figures 4–6 can be applied to G do
|
||
2:
|
||
perform the rewrite rule on G
|
||
3: end while
|
||
4: if G is final then
|
||
5:
|
||
return G
|
||
6: else
|
||
7:
|
||
compute the set R of all possible repairs H of G
|
||
8:
|
||
sort R in increasing order by | L(H)≤n|
|
||
9:
|
||
for each of the min(, |R|) best repairs H do
|
||
10:
|
||
if | L(H)≤n| < | L(Hopt )≤n| then
|
||
11:
|
||
recursively compute H := RWR2 -AUX(H, Hopt )
|
||
12:
|
||
set Hopt := H if | L(H )≤n| < | L(Hopt )≤n|
|
||
13:
|
||
end if
|
||
14:
|
||
end for
|
||
15:
|
||
return Hopt
|
||
16: end if
|
||
|
||
recursively rewrite and repair G until a final automaton is reached. During
|
||
this recursion, Hopt is the best final generalized SOA found so far. Initially, on
|
||
line 6 of RWR2 , Hopt is set to the final generalized SOA that accepts all words
|
||
over alphabet symbols mentioned in G. RWR2 -AUX then rewrites G in lines 1–2
|
||
until no more rewrite rule is applicable. If the resulting G is final then it is
|
||
returned. Otherwise, RWR2 -AUX computes in line 6 all possible repairs H of G
|
||
and orders them according to increasing values of | L(H)≤n|. The algorithm then
|
||
recursively calls itself on the best ranked repairs in lines 8–10. The test in
|
||
line 10 is an optimization: if the current repair is already worse than the best
|
||
final generalized SOA Hopt computed so far in terms of language size, then
|
||
further rewriting and repairing cannot yield a final generalized SOA that is
|
||
better than Hopt . Lines 11 and 12 update Hopt when appropriate. Finally, Hopt
|
||
is returned.
|
||
Given its definition, it is clear that RWR2 results in regular expressions with
|
||
a smaller language size for increasing values of , of course at the cost of
|
||
increased computation time. In the experiments (Section 7.2) the trade-off between precision and computation time of RWR and RWR2 , for increasing values
|
||
of , is investigated in more detail.
|
||
4.3 Efficiently Computing the Language Size
|
||
During its executing, RWR2 repeatedly needs to compute the language size of
|
||
the possible repairs. This computation can actually be done quite efficiently
|
||
for SOAs, as we show next. Of course, in general RWR2 needs to compute the
|
||
language size also for generalized SOAs, not just ordinary SOAs. Our implementation first expands such generalized SOAs into an equivalent SOA using
|
||
the Glushkov construction (similar to the ideas of the proof of Proposition 45
|
||
in the online appendix that can be accessed in the ACM Digital Library), and
|
||
then invokes the language size computation procedure explained next.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:26
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Let |L=m| denote the number of words in L of length exactly m. Let G be a
|
||
SOA; and assume that V(G) − {src, sink} = {a1 , . . . , an}. Then consider the n × n
|
||
matrix D where for i, j ∈ {1, . . . , n}
|
||
|
||
1 if (ai , a j ) ∈ E; and,
|
||
D[i, j] =
|
||
0 otherwise.
|
||
In addition, define the 1 × n and n× 1 matrices I and F, respectively, as follows:
|
||
for i, j ∈ {1, . . . , n}
|
||
|
||
1 if (src, j) ∈ E; and,
|
||
I[1, j] =
|
||
0 otherwise;
|
||
and
|
||
|
||
|
||
F[i, 1] =
|
||
|
||
1 if (i, sink) ∈ E; and,
|
||
0 otherwise.
|
||
|
||
The following lemma is straightforward to prove by induction on n using
|
||
the fact that each walk from src to sink in G uniquely determines an accepted
|
||
word. Let Dm denote the m-times multiplication of D, with D0 the unit matrix.
|
||
LEMMA 23.
|
||
|
||
Let m > 0 and let G be a SOA. Then | L(G)=m| = I · Dm−1 · F.
|
||
|
||
Since for m = 0, we simply have | L(G)=m| = 1 if (src, sink) ∈ E, and
|
||
n
|
||
| L(G)=m|, we can deter| L(G)=m| = 0, otherwise and since | L(G)≤n| = m=0
|
||
≤n
|
||
mine | L(G) | by iteratively computing the matrices D1 to Dm, and applying
|
||
Lemma 23. This immediately gives the following corollary.
|
||
COROLLARY 24.
|
||
time O(n|G|3 ).
|
||
|
||
For each n > 0 and SOA G, | L(G)≤n| can be computed in
|
||
|
||
5. CORRECTION
|
||
In the conference version of this article [Bex et al. 2006] we proposed a different set of rewrite and repair rules for transforming SOAs into SOREs. While
|
||
those rewrite rules were claimed in Bex et al. [2006] to possess the analog of
|
||
Proposition 19 (namely that they always produce a SORE equivalent to the
|
||
input SOA, provided that such a SORE exists), this claim is false, as we will
|
||
detail next. Readers unfamiliar with Bex et al. [2006] may freely skip this
|
||
section without endangering comprehension of the rest of the article.
|
||
To illustrate why the preceding claim is false, the rewrite rules of Bex et al.
|
||
[2006] are given in Figure 10, where G∗ refers to the ε-closure of G, defined as
|
||
follows.
|
||
Definition 25. Let G = (V, E) be a generalized SOA. The ε-closure G∗ of G
|
||
is the graph (V, E∗ ) where E∗ contains:
|
||
—all edges of E;
|
||
—all edges (r, r) with r = s+ or r = s+ ?;
|
||
—all edges (r, s) for which there is a path from r to s in G that passes only
|
||
through intermediate nodes t with ε ∈ L(t).
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:27
|
||
|
||
Fig. 10. Set of rewrite rules introduced in the conference version of this article [Bex et al. 2006].
|
||
|
||
Figure 11 shows a sequence of rewrite steps using these rules starting from
|
||
the SOA recognizing (a + b)+ ? or, equivalently, (a? . b?)+ . Note that the second
|
||
rewrite step, which introduces b?, causes the automaton to become disconnected: because a? ∈ PredG∗ (b) and sink ∈ SuccG∗ (b) − {b} it deletes (a?, sink)—
|
||
the only edge linking src to sink. As such, the accepted language changes from
|
||
L((a + b)+ ?) to ∅. This clearly illustrates that the OPTIONAL r? rule in Figure 10
|
||
is unsound. For that reason, we have moved in this article to the new rewrite
|
||
rules in Figures 4–6.
|
||
It is peculiar, however, that we have extensively used the rewrite rules of
|
||
Figures 10 together with the repair rules in Figure 13 in a prototype implementation but have never encountered a situation where:
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:28
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 11. A problematic sequence of rewrite steps using the rules in Figure 10. The input SOA
|
||
accepts the same language as (a+b)+ ?, or, equivalently (a? . b?)+ . Note that the automaton resulting
|
||
from by the second rewrite step is disconnected and hence accepts the empty language. Rewriting
|
||
is therefore not sound.
|
||
|
||
Fig. 12. A succesfull sequence of rewrite steps using the rules in Figure 10. The input SOA accepts
|
||
the same language as (a + b)+ ?, or, equivalently (a? . b?)+ .
|
||
|
||
—we obtained a SORE r that failed to accept at least all words in the input
|
||
SOA G; or
|
||
—we obtained a SORE r that accepted a strict superset of L(G) when G was
|
||
equivalent to a SORE.
|
||
We suspect that this behavior is due to the strict order in which we apply the
|
||
rewrite rules in our implementation: first CONCATENATION, then DISJUNCTION,
|
||
then SELF-LOOP, and finally OPTIONAL. To illustrate, Figure 12 shows a successful
|
||
rewriting of the SOA accepting (a + b)+ ? under this order.
|
||
The inference algorithm of Bex et al. [2006], which we shall call RWR0 in this
|
||
article, is shown in Algorithm 6. It is based on the rewrite rules in Figure 10
|
||
and the repair rules in Figure 13. The experiments in Section 7 indicate that
|
||
0
|
||
2
|
||
RWR has no benefits over RWR and RWR . Moreover, as we do not have a formal
|
||
soundness and completeness proof showing that rewriting always produces a
|
||
SORE equivalent to the input SOA (provided that such a SORE exists) under
|
||
this order, it does not make much sense to consider RWR0 for the class of SOREs.
|
||
In strong contrast, on the class of k-occurrence regular expressions (k > 1), RWR0
|
||
can make a difference over RWR and RWR2 [Bex et al.]. So even without formal
|
||
guarantees, RWR0 still has its its merits.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:29
|
||
|
||
Algorithm 6. RWR0
|
||
Input: a SOA G
|
||
Output: a SORE r
|
||
1: if sink is not reachable from src in G then
|
||
2:
|
||
return ∅
|
||
3: else if E(G) = {(src, sink)} then
|
||
4:
|
||
return ε
|
||
5: else
|
||
6:
|
||
initialize done to false
|
||
7:
|
||
while not done do
|
||
8:
|
||
if there a rewrite rule in Figure 10 is applicable then
|
||
9:
|
||
rewrite G, giving precedence to CONCATENATION, then DISJUNCTION, then SELFLOOP, then OPTIONAL
|
||
10:
|
||
else if a repair rule in Figure 13 is applicable then
|
||
11:
|
||
repair G, giving precedence to ENABLE-DISJUNCTION, then ENABLE-OPTIONAL-1,
|
||
then ENABLE-OPTIONAL-2
|
||
12:
|
||
else
|
||
13:
|
||
set done to true
|
||
14:
|
||
end if
|
||
15:
|
||
end while
|
||
16:
|
||
if G is final then
|
||
17:
|
||
return the corresponding regular expression r
|
||
18:
|
||
else
|
||
19:
|
||
return ∅
|
||
20:
|
||
end if
|
||
21: end if
|
||
|
||
6. INFERRING CHARES: CRX
|
||
In this section, we present the algorithm CRX for the inference of chain regular
|
||
expressions (CHAREs).
|
||
Definition 26 (CHAREs ). The class of chain regular expressions consists of
|
||
those SOREs of the form f1 · · · fn where every fi is a chain factor—an expression
|
||
of the form (a1 + · · · + ak), (a1 + · · · + ak)?, (a1 + · · · + ak)+ , or, (a1 + · · · + ak)+ ? with
|
||
k ≥ 1 and every ai is an alphabet symbol.
|
||
For instance, the expression a(b+c)+ ?d+ (e + f )? is a CHARE, while (ab+c)+ ?
|
||
and (a+ ? + b?)+ ? are not.
|
||
Since each CHARE is a concatenation of alphabet-disjoint chain factors,
|
||
every occurrence of an alphabet symbol in a word must be generated by the
|
||
same chain factor in the target CHARE. The positional relationships between
|
||
occurrences of alphabet symbols in a given sample then allow us to deduce
|
||
which chain factors are present in the target CHARE, and how they are ordered.
|
||
Example 27. Consider the sample S = {u, v, w} with u = abd, v = bcdee,
|
||
and w = cade. Clearly a occurs before b in u, b occurs before c in v, and c occurs
|
||
before a in w. In the target CHARE, therefore, a, b, and c must belong to the
|
||
same chain factor which can only be (a + b + c)+ or (a + b + c)+ ?. Since one of
|
||
{a, b, c} is present in every word of S, we choose (a + b + c)+ . Similarly, d and
|
||
e form chain factors by themselves. Whereas d occurs once in every word in S,
|
||
e can occur zero, one, or more times. Therefore, d is represented by the chain
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:30
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Fig. 13. Repair rules accompanying the rewrite rules in Figure 10. These rules are a correction
|
||
of the rules presented in Bex et al. [2006]. Repairs are tried in the order shown. In particular,
|
||
ENABLE-OPTIONAL-2 is only applied if none of the other rules is applicable.
|
||
|
||
factor d, while e is represented by the chain factor e+ ?. Since a, b, c always occur
|
||
before d, which in turn always occurs before the e’s, the derived CHARE is then
|
||
(a + b + c)+ de+ ?.
|
||
So, in brief, CRX computes chain factors, orders them, and uses that order to
|
||
generate a CHARE. Of course, the order of the chain factors is not necessarily
|
||
linear. In that case, a linear order can be constructed by making the factors
|
||
optional. Some care has to be taken, however, to generate factors that are
|
||
disjunctions without repetitions.
|
||
Definition 28. Let S be a sample. We denote by → S the partial preorder on
|
||
such that a → S b if, and only if, a immediately precedes b in some w ∈ S.
|
||
(I.e., ab is a 2-gram of S.) We say that a occurs before b in S if a →∗S b, where
|
||
→∗S is the reflexive and transitive closure of → S.
|
||
For instance, Figure 14 illustrates → S when S = {abccde, cccad, bf egg,
|
||
bf ehi}.
|
||
Definition 29. Define a ≈ S b if a occurs before b in S and b occurs before a.
|
||
That is, a ≈ S b if a →∗S b and b →∗S a.
|
||
Clearly, ≈ S is an equivalence relation. Let S denote the set of equivalence classes of ≈ S. In what follows, we denote such equivalence classes by, for
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:31
|
||
|
||
Fig. 14. The partial preorder → S for S = {abccde, cccad, bf egg, bf ehi}.
|
||
|
||
Fig. 15. The Hasse diagram HS of the sample S = {abccde, cccad, bf egg, bf ehi}. The corresponding
|
||
partial preorder from which HS is derived is shown in Figure 14.
|
||
|
||
example, [a1 , . . . , an]. As usual, an equivalence class of cardinality 1 is called a
|
||
singleton.
|
||
Definition 30. The Hasse diagram of S, denoted HS, is the graph over S
|
||
in which there is an edge from equivalence class [a1 , . . . , an] to class [b1 , . . . , bm]
|
||
if: (1) [a1 , . . . , an] and [b1 , . . . , bm] are distinct and (2) there exists 1 ≤ i ≤ n and
|
||
1 ≤ j ≤ m such that ai → S b j .
|
||
For instance, the Hasse diagram of the sample S = {abccde, cccad, bf egg,
|
||
bf ehi} is shown in Figure 15. The operation of CRX is then shown in Algorithm 7
|
||
and illustrated in the following example.
|
||
Example 31. Consider again the sample S = {abccde, cccad, bf egg, bf ehi}
|
||
and its corresponding Hasse diagram in Figure 15. Since Pred HS ([d]) =
|
||
Pred HS ([ f ]) and Succ HS ([d]) = Succ HS ([ f ]), line 3 applies to [d] and [ f ]. Although
|
||
Pred HS ([g]) = Pred HS ([h]), step 2 cannot be applied as Succ HS ([g]) = Succ HS ([h]).
|
||
Similarly [g] and [i] share successors, that is, ∅, but have different predecessors.
|
||
Hence, after the while loop in line 2 we obtain:
|
||
|
||
A possible topological sort is [a, b, c], [d, f ], [e], [g], [h], [i]. Since at least one of
|
||
a, b, and c occurs once or more in every string of W, r([a, b, c]) = (a + b + c)+ is
|
||
the first factor; the second factor is (d + f ) since either d or f occurs exactly
|
||
once; the factor derived from [e] is e? since W contains a string without e
|
||
and similarly for those from [h] and [i]. Finally, g occurs multiple times in a
|
||
single string. Hence the simple regular expression derived by the algorithm is
|
||
(a + b + c)+ · (d + f ) · e? · g+ ? · h? · i? which completes step 6.
|
||
Note that the order of the chain factors in the CHARE depends on the
|
||
topological sort.
|
||
THEOREM 32.
|
||
L(S).
|
||
|
||
Given a sample S, CRX computes a CHARE r such that S ⊆
|
||
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:32
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Algorithm 7. CRX
|
||
Input: a sample S
|
||
Output: a CHARE r such that S ⊆ L(r)
|
||
1: Compute the set S of equivalence classes of ≈ S
|
||
2: while a maximal set of singleton nodes γ1 , . . . , γ such that Pred HS (γ1 ) = · · · =
|
||
Pred HS (γ ) and Succ HS (γ1 ) = · · · = Succ HS (γ ) exists do
|
||
3:
|
||
Replace γ1 , . . . , γ by γ := ∪j=1 γ j , and redirect all incoming and outgoing edges of
|
||
the γi to γ in HS
|
||
4: end while
|
||
5: Compute a topological sort γ1 , . . . , γk of the nodes
|
||
6: for all i ∈ {1, . . . , k} (γi = [a1 , . . . , an]) do
|
||
7:
|
||
if every w ∈ S contains exactly one occurrence of a symbol in {a1 , . . . , an} then
|
||
8:
|
||
r(γi ) := (a1 + · · · + an)
|
||
9:
|
||
else if every w ∈ S contains at most one occurrence of a symbol in {a1 , . . . , an}
|
||
then
|
||
10:
|
||
r(γi ) := (a1 + · · · + an)?
|
||
11:
|
||
else if every w ∈ S contains at least one of a1 , . . . , an and there is a word that
|
||
contains at least two occurrences of symbols then
|
||
12:
|
||
r(γi ) := (a1 + · · · + an)+
|
||
13:
|
||
else
|
||
14:
|
||
r(γi ) := (a1 + · · · + an)+ ?
|
||
15:
|
||
end if
|
||
16:
|
||
return r(γ1 ) . r(γ2 ) . · · · . r(γk)
|
||
17: end for
|
||
|
||
PROOF. The theorem follows almost immediately from the construction.
|
||
Clearly, CRX always outputs a CHARE. Moreover, observe that after step 5
|
||
the computed topological sort is consistent with the order of the symbols in the
|
||
words in S. More precisely, there can not exist symbols a and b, such that a ∈ γi ,
|
||
b ∈ γ j , i < j, and b →∗S a. Subsequently, for each γi a chain factor is chosen
|
||
in such a manner that it is consistent with all words w ∈ S. As these factors
|
||
are ordered consistently with the order of the symbols in S, this implies that
|
||
S ⊆ L(r).
|
||
Furthermore, on the class of CHAREs, CRX is complete.
|
||
THEOREM 33.
|
||
L(CRX(S)).
|
||
|
||
For each CHARE r there is a sample S such that L(r) =
|
||
|
||
PROOF. Denote by Sym(r) the set of alphabet symbols occurring in r. We also
|
||
abuse notation and, for a sample S, write Sym(S) to denote the set of alphabet
|
||
symbols occurring in S. Let r = f1 · · · fk be a CHARE, with each fi a chain
|
||
factor. We construct the sample S such that the CRX(S) is syntactically equal to
|
||
r, up to commutativity of +. The theorem then follows.
|
||
Thereto, for every 1 ≤ i ≤ k, let wi be a word in L( fi ). We construct S by
|
||
subsequently adding words to it. First, for all 1 ≤ i ≤ k − 1, a ∈ Sym( fi ),
|
||
b ∈ Sym( fi+1 ), we add w1 · · · wi−1 abwi+2 · · · wk to S. Further, for all 1 ≤ i ≤ k,
|
||
we add words to S, depending on the form of fi . Specifically, if fi is of the
|
||
form:
|
||
—(a1 + · · · + an), we add w1 · · · wi−1 a1 wi+1 · · · wk;
|
||
—(a1 + · · · + an)?, we add w1 · · · wi−1 wi+1 · · · wk, and w1 · · · wi−1 a1 wi+1 · · · wk;
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:33
|
||
|
||
—(a1 + · · · + an)+ , we add w1 · · · wi−1 a1 a1 wi+1 · · · wk;
|
||
—(a1 + · · · + an)+ ?, we add w1 · · · wi−1 wi+1 · · · wk, and w1 · · · wi−1 a1 a1 wi+1 · · · wk.
|
||
We now argue that given S, CRX indeed derives an expression syntactically
|
||
equal to r. First observe that already before step 3, CRX computes k nodes γ1 to
|
||
γk, which are linearly ordered, such that for each 1 ≤ i ≤ k, γi contains exactly
|
||
the alphabet symbols contained in fi . Then, due to the number of occurrences
|
||
of each symbol of the different chain factors, the algorithm will associate to
|
||
each γi exactly the factor fi , and hence CRX(S) is syntactically equivalent to r,
|
||
up to commutativity of +.
|
||
From Theorems 32 and 33 it readily follows that we have the next corollary.
|
||
COROLLARY 34.
|
||
|
||
CRX learns the class of CHAREs from positive data.
|
||
|
||
The experiments in Section 7.3 show that the number of words in S needed
|
||
in practice is very small. Actually, the prime feature that makes CRX much
|
||
more robust than RWR for very small datasets is its strong generalization ability. Indeed, consider an expression of the form (a1 + · · · + an)+ ?. While REWRITE
|
||
requires all n2 2-grams of the form ai a j for i, j ∈ {1, . . . , n} to be present, RWR
|
||
requires around (n2 − n) 2-grams. For CRX, however, the set {ε, a1 a2 , a2 a3 , . . . ,
|
||
an−1 an, ana1 } of size O(n) will suffice. This point is illustrated in practice
|
||
by example3 and example4 in Table II where n has a value of 41 and 56,
|
||
respectively. Experiments illustrate that only 400 1682 and 500 3136
|
||
2-grams are needed by CRX to learn example3 and example4, respectively.
|
||
The following theorem shows that CRX is optimal within the class of CHAREs
|
||
when the partial order S is in fact a linear order.
|
||
THEOREM 35. For every sample S, if S is a linear order then for every
|
||
CHARE r such that S ⊆ L(r) and L(r) ⊆ L(CRX(S)), we have r = CRX(S), that is, r
|
||
is syntactically equal to CRX(S) up to commutativity of +.
|
||
PROOF. Assume that CRX(S) = f1 · · · fk and r = g1 · · · gl . Clearly,
|
||
Sym(CRX(S)) = Sym(r) = Sym(S). We first argue that k = l. Thereto, assume
|
||
for the purpose of contradiction that k < l. Then, there is a chain factor f in
|
||
CRX(S) with a, b ∈ Sym( f ) and two chain factors g and g in r with a ∈ Sym(g)
|
||
and b ∈ Sym(g ). We distinguish two cases.
|
||
(1) If f is of the form (a1 + · · · + an) or (a1 + · · · + an)?, then L(r) ⊆ L(CRX(S)).
|
||
(2) If f is of the form (a1 + · · · + an)+ ? or (a1 + · · · + an)+ , by construction and
|
||
since S is linearly ordered, there are words u1 , u2 ∈ S such that a →∗u1 b
|
||
and b →∗u2 a. However, since a and b are in different chain factors of r,
|
||
/ L(r) or u2 ∈
|
||
/ L(r), and hence S ⊆ L(r).
|
||
either u1 ∈
|
||
Conversely, assume k > l. Then, there are chain factors f, f in CRX(S) with
|
||
a ∈ Sym( f ) and b ∈ Sym( f ), and a chain factor g in r with a, b ∈ Sym(g). We
|
||
again distinguish two cases.
|
||
(1) If g is of the form (a1 + · · · + an)+ ? or (a1 + · · · + an)+ , then L(r) ⊆ L(CRX(S)).
|
||
(2) If g is of the form (a1 +· · ·+an) or (a1 +· · ·+an)?, by construction and since S
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:34
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
is linearly ordered, there are words u1 , . . . , um ∈ S, and symbols c1 , . . . , cm−1
|
||
such that a →∗u1 c1 , cm →∗um b, and ci →ui+1 ci+1 , for all 1 ≤ i ≤ m − 1.
|
||
/ L(r) must
|
||
However, due to the form of g, for at least one of these ui , ui ∈
|
||
hold and hence S ⊆ L(r).
|
||
Using the same kind of argument it can be shown that Sym( fi ) = Sym(gi ),
|
||
for all 1 ≤ i ≤ k. Further, since L(r) ⊆ L(CRX(S)), for every 1 ≤ i ≤ k, we
|
||
have L(gi ) ⊆ L( fi ). Since the different chain factors can only take a restricted
|
||
numbers of forms, it now suffices to show that L(gi ) = L( fi ), for all i, to show that
|
||
they are also syntactically equivalent. Hence, towards a contradiction, assume
|
||
L(gi ) L( fi ) for some 1 ≤ i ≤ k. This can only be the case if: (1) gi = (a1 +· · ·+an)
|
||
and fi = (a1 + · · · + an); (2) gi = (a1 + · · · + an)+ ? and fi = (a1 + · · · + an)+ ; or
|
||
(3) gi = (a1 + · · · an)? and fi is one of the three other forms. However, in each of
|
||
these cases, given the construction of the algorithm, one can find a word w ∈ S
|
||
such that w ∈
|
||
/ L(r). Hence, for all i, L( fi ) = L(gi ), and thus r = CRX(S).
|
||
Note that this property does not hold when S is not linear. For instance, on
|
||
S = {abc, ade, abe} CRX yields a·b?·d?·c?·e? whereas the CHARE a·(b+d)·(c +e)
|
||
is a better approximation of the target language.
|
||
CRX can be efficiently executed on very large datasets by only maintaining
|
||
HS and the multiplicities of occurrences of -symbols in words in S (needed for
|
||
lines 6–13). From this representation, lines 2–5 can be executed. Hence, it is
|
||
not necessary that the entire sample resides in main memory. The complexity
|
||
of the algorithm is O(m + n3 ), where m is the size of the sample and n the
|
||
number of alphabet symbols.
|
||
7. EXPERIMENTAL EVALUATION
|
||
In this section we validate our approach by means of experimental analysis.
|
||
Specifically, we assess the quality of the expressions returned by our algorithms
|
||
on real-world corpora and DTDs, and compare it with the quality of expressions
|
||
returned by XTRACT [Garofalakis et al. 2003] and Trang [Clark]. Next, we compare the quality of RWR0 (the algorithm found in the conference version of this
|
||
article), RWR, and RWR2 . Subsequently, we investigate the performance of the algorithms on incomplete and noisy data. Finally, we discuss their running time
|
||
performance. We abuse notation and simply write RWR for the application of
|
||
2T-INF followed by RWR, similarly for RWR0 and RWR2 . All experiments were performed using a prototype implementation of our algorithms in Java executed
|
||
on a 2.5 Ghz Pentium 4 machine with 1GB of RAM.
|
||
7.1 Real-World Examples
|
||
The number of publicly available XML corpora is rather limited. We employed
|
||
the XML Data repository maintained by Miklau [2002] as a testbed. Unfortunately, most of the corpora listed there are either very small, lack a DTD,
|
||
or contain a DTD with only trivial regular expressions. Nevertheless, two of
|
||
the listed corpora are interesting. Specifically, we compared XTRACT, RWR, and
|
||
CRX on the Protein Sequence Database (683Mb in size) and the Mondial corpus
|
||
[Miklau 2002], a database of information on various countries (1Mb in size).
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:35
|
||
|
||
Table I. Results of RWR, CRX and XTRACT on DTDs and Sample Data from
|
||
the Protein Description Database and the Mondial Corpora
|
||
Element
|
||
Original DTD
|
||
Sample
|
||
Result of CRX/ RWR
|
||
size
|
||
Result of XTRACT
|
||
ProteinE.
|
||
a1 a2 a3 a4 + ?a5 + ?a6 + ?a7 + ?a8 + ?a9 ?a10 ?a11 + ?a12 a13
|
||
2458
|
||
a1 a2 a3 a4 + a5 + ?a6 + ?a7 + ?a8 + ?a9 ?a10 ?a11 + ?a12 a13
|
||
843
|
||
an expression of 185 tokens
|
||
organism
|
||
a1 a2 ?a3 a4 ?a5 + ?
|
||
9
|
||
a1 a2 ?a3 a4 ?a5 + ?
|
||
9
|
||
a1 ((a2 a3 a4 ?+a3 a4 )a5 ?+a3 a5 + ?)
|
||
reference
|
||
a1 a2 + ?a3 + ?a4 + ?
|
||
45
|
||
a1 a2 + ?a3 + ?a4 + ?
|
||
45
|
||
a1 (a2 + ?(a4 + ?+a3 + ?)+a2 a3 + ?a4 a4 +a3 + ?a4 + ?)
|
||
refinfo
|
||
a1 a2 a3 ?a4 ?a5 a6 ?(a7 +a8 )?a9 ?
|
||
10
|
||
a1 a2 (a3 +a4 )?a5 a6 ?a7 ?a9 ?a8 ?
|
||
10
|
||
a1 a2 ((a3 a5 a6 a7 ?+a4 a5 )a9 ?+a5 (a7 +a8 )?+a4 a5 a8 )
|
||
authors
|
||
a1 + +(a2 a3 ?)
|
||
54
|
||
a1 + ?a2 ?a3 ? /
|
||
a1 + +(a2 a3 )
|
||
54
|
||
a1 + ?+a2 a3
|
||
accinfo
|
||
a1 a2 + ?a3 + ?a4 ?a5 ?a6 ?a7 + ?
|
||
124
|
||
a1 a2 + ?a3 + a4 ?a5 ?a6 ?a7 + ?
|
||
124
|
||
an expression of 97 tokens
|
||
genetics
|
||
a1 + ?a2 ?a3 ?a4 ?a5 ?a6 ?a7 ?a8 ?a9 ?a10 ?a11 + ?a12 + ?
|
||
219
|
||
a1 + ?a2 ?a3 ?a4 ?a5 ?a6 ?a7 ?a8 ?a9 ?a10 ?a12 + ?
|
||
219
|
||
an expression of 329 tokens
|
||
function
|
||
a1 ?a2 + ?a3 + ?
|
||
26
|
||
a1 ?a2 + ?a3 + ?
|
||
26
|
||
(a1 (a2 ?a2 ?a3 + ?+a2 + ?(a3 a3 )+ ?+a2 a2 a2 a3 )+a2 (a2 a3 + ?+a3 + ?))
|
||
city
|
||
a1 a2 + ?a3 + ?
|
||
9
|
||
a1 a2 + ?a3 + ?
|
||
9
|
||
a1 (a2 + ?a3 a3 ?+a2 (a3 + ?+a2 ))?
|
||
The left column gives element names, sample size for CRX/ RWR, and sample size for
|
||
XTRACT, respectively. The right column lists original DTD, inferred DTD by CRX/ RWR,
|
||
and the result of XTRACT, in that order.
|
||
|
||
Since no real-world data could be obtained for SOREs that are not CHAREs,
|
||
we generated our own XML data for a number of real-world DTDs considered
|
||
in Bex et al. [2004] containing a number of sophisticated regular expressions
|
||
outside the class of CHAREs.
|
||
Real-world data. In this section, we only discuss RWR as RWR0 and RWR2 give
|
||
precisely the same results. Table I lists all nontrivial element definitions2 in
|
||
the aforementioned DTDs together with the results derived by the inference
|
||
algorithms RWR, CRX, and XTRACT. It is interesting to note that only the regular
|
||
expression for authors is not a CHARE. Moreover, no elements are repeated
|
||
in any of the definitions. This should not come as a surprise given the observations discussed in the Introduction on the content models occurring in practice.
|
||
The regular expression derived by the XTRACT algorithm is shown whenever
|
||
it fitted the table, otherwise the number of tokens it consists of is listed. For
|
||
better readability the actual output of XTRACT has been simplified by replacing
|
||
expressions such as (ai + ε) by ai ?.
|
||
2 It should be noted that the examples from the Mondial corpus are not valid according to their
|
||
DTD, so for the city element only valid elements were used as training examples.
|
||
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:36
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
It can be verified that all regular expressions in Table I are learned quite
|
||
satisfactory by RWR and CRX with respect to the examples extracted from the
|
||
XML corpus. The numbers in the first column refer to the size of the sample.
|
||
RWR and CRX always produce the same result except for authors where CRX
|
||
cannot derive the target expression as it is not a CHARE. We note that no
|
||
sample was representative of its target expression. As such, RWR always had to
|
||
apply repair rules. The expressions in the table indicate that the result of these
|
||
repairs are satisfactory. For a few expressions, for instance, ProteinE(ntry),
|
||
refinfo, and genetics, the expressions produced by CRX and RWR are more
|
||
strict than the corresponding one in the DTD. This is due to the data present
|
||
in the sample. For instance, for genetics, no a11 element occurs in the sample
|
||
so it obviously cannot be part of the derived expression. The element refinfo
|
||
illustrates that a3 and a4 are mutually exclusive in the sample and that a8 is
|
||
never followed by a9 . Inspecting the original DTD illustrates the underlying
|
||
semantics.
|
||
authors, citation, volume?, month?, year,
|
||
pages?, (title | description)?, xrefs?
|
||
Indeed, volume is used in the context of a journal, while month is used for a
|
||
conference publication. Apart from the authors element XTRACT either produces
|
||
a suboptimal expression or no expression at all. For instance, XTRACT crashes on
|
||
the ProteinE(ntry) sample due to excessive memory consumption (more than
|
||
1GB of RAM). Reducing the size of the sample to approximately 800 unique
|
||
words yields a complex expression of 185 tokens.
|
||
Real-world regular expressions. Table II lists the results of the algorithms on
|
||
a number of more sophisticated regular expressions extracted from real-world
|
||
DTDs discussed in Bex et al. [2004]. Since no real-world data was available
|
||
for those DTDs, we have randomly generated samples using ToXgene [Barbosa
|
||
et al. 2002], taking care that all relevant examples where present to ensure
|
||
the target expression could be learned. Again, we list the sample size in the
|
||
first column. As some of these numbers might seem artificially large, we note
|
||
that, for instance, the SOA corresponding to example3 already contains 1897
|
||
edges. Hence, a random dataset of 5741 words is not unreasonably large. Note
|
||
that only the first three expressions in Table II are SOREs, none of them
|
||
is a CHARE. The table shows clearly that CRX yields fairly good and concise
|
||
super-approximations to the original expressions. In some cases, the results
|
||
produced by RWR are more precise. For XTRACT, the size of the sample had to be
|
||
limited to 300–500 in order to avoid a crash. As can be seen from the table,
|
||
XTRACT performed excellently on the first example, but failed to generate an
|
||
expression that fitted the table in all other cases on all the sample sets we
|
||
tried.
|
||
Trang. We ran Trang [Clark] on the XML data discussed in this section.
|
||
In all but one case, Trang produced exactly the same output as CRX, with a
|
||
notable exception: for example1 Trang’s output depends on the order in which
|
||
the examples are presented, yielding either a1 + ?a2 ?a3 + ? or a1 + + (a2 ?a3 + ). The
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:37
|
||
|
||
Table II. Results of RWR, CRX and XTRACT on
|
||
Nonsimple Real-World DTDs and Generated Data
|
||
Original DTD
|
||
Element
|
||
Result of CRX
|
||
Sample
|
||
Result of RWR
|
||
size
|
||
Result of XTRACT
|
||
example1
|
||
a1 + + (a2 ?a3 + )
|
||
48
|
||
a1 + ?a2 ?a3 + ?
|
||
48
|
||
a1 + + (a2 ?a3 + )
|
||
48
|
||
a1 + ? + (a2 ?a3 + ?)
|
||
example2
|
||
(a1 a2 ?a3 ?)?a4 ?(a5 + · · · + a18 )+ ?
|
||
2210
|
||
a1 ?a2 ?a3 ?a4 ?(a5 + · · · + a18 )+ ?
|
||
2210
|
||
(a1 a2 ?a3 ?)?a4 ?(a5 + · · · + a18 )+ ?
|
||
300
|
||
an expression of 252 tokens
|
||
example3
|
||
a1 ?(a2 a3 ?)?(a4 + · · · + a44 )+ ?a45 +
|
||
5741
|
||
a1 ?a2 ?a3 ?(a4 + · · · + a44 )+ ?a45 +
|
||
5741
|
||
a1 ?(a2 a3 ?)?(a4 + · · · + a44 )+ ?a45 +
|
||
400
|
||
an expression of 142 tokens
|
||
example4 a1 ?a2 a3 ?a4 ?(a5 + + ((a6 + · · · + a61 )+ a5 + ?))
|
||
10000
|
||
a1 ?a2 a3 ?a4 ?(a6 + · · · + a61 )+ ?a5 + ?
|
||
10000
|
||
a1 ?a2 a3 ?a4 ?(a6 + · · · + a61 )+ ?a5 + ?
|
||
500
|
||
an expression of 185 tokens
|
||
+
|
||
example5
|
||
a1 (a2 + a3 )+ ?(a4 (a2 + a3 + a5 )+ ?) ?
|
||
+
|
||
1281
|
||
a1 (a2 + a3 + a4 + a5 ) ?
|
||
+
|
||
1281
|
||
a1 ((a2 + a3 + a4 )+ a5 + ?) ?
|
||
500
|
||
an expression of 85 tokens
|
||
The left column gives element names, sample size for CRX,
|
||
RWR and XTRACT, respectively. The right column lists original
|
||
DTD, inferred DTD by CRX, by RWR and the result of XTRACT,
|
||
in that order.
|
||
|
||
former is the same output as CRX, the latter is the intended RE that cannot
|
||
be derived by CRX as it is outside the class of CHAREs. This inconsistency in
|
||
Trang’s output casts some doubt on its correctness and underscores the need
|
||
for a formal model as the cornerstone of an implementation. Indeed, there is no
|
||
article or manual available describing the machinery underlying Trang. A look
|
||
at the Java-code indicates that Trang is related to, but different from, CRX: it
|
||
uses 2T-INF to construct an automaton, eliminates cycles by merging all nodes
|
||
in the same strongly connected component, and then transforms the obtained
|
||
DAG into a regular expression. However, no target class of REs for which Trang
|
||
is complete, as is the case for CRX, is specified. As Trang is similar to CRX, it is
|
||
outperformed by RWR and RWR2 .
|
||
7.2 RWR versus RWR2
|
||
We tested the results and performance of RWR versus RWR2 for various values
|
||
of the rank cut-off parameter . The SOAs used in this test were randomly
|
||
generated with 5 and 10 alphabet symbols. The results are summarized in
|
||
Table III(a). We computed the average language size of the SOAs, which is the
|
||
target size. It should be noted that since no SORE corresponds to these SOAs,
|
||
the target size can never be attained since the regular expression resulting
|
||
from RWR or RWR2 will necessarily be a generalization of the SOA’s language.
|
||
It is immediately clear from Table III(a) that results of RWR2 are on average
|
||
better than those for RWR, and that they improve with increasing values of .
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:38
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
Table III.
|
||
(a)
|
||
|| = 5 || = 10
|
||
target size 0.52
|
||
0.67
|
||
0
|
||
|
||
RWR
|
||
|
||
RWR
|
||
|
||
0.88
|
||
0.80
|
||
|
||
0.98
|
||
0.96
|
||
|
||
0.76
|
||
0.73
|
||
0.725
|
||
0.722
|
||
0.721
|
||
0.720
|
||
|
||
0.95
|
||
0.92
|
||
0.916
|
||
0.911
|
||
0.908
|
||
N/A
|
||
|
||
2
|
||
|
||
RWR
|
||
|
||
1
|
||
2
|
||
3
|
||
4
|
||
5
|
||
∞
|
||
|
||
(b)
|
||
RWR || = 5 || = 10
|
||
|
||
2
|
||
|
||
1
|
||
2
|
||
3
|
||
4
|
||
5
|
||
∞
|
||
|
||
28.8%
|
||
7.6%
|
||
3.2%
|
||
1.3%
|
||
0.7%
|
||
24.6%
|
||
|
||
46.3%
|
||
7.3%
|
||
1.2%
|
||
0.0%
|
||
0.0%
|
||
N/A
|
||
|
||
(a) Average language size for RWR and RWR2 for various values of
|
||
. = ∞ denotes an exhaustive exploration of all possible repairs.
|
||
(b) Percentage of target expressions for which RWR outperforms RWR2 .
|
||
|
||
For expressions of alphabet size 5, we were able to consider all possible repairs,
|
||
resulting in the entry for = ∞ in Table III(a). This represents the smallest
|
||
language that includes the SOA’s language and that can be expressed by a
|
||
SORE.
|
||
Of course, the results in Table III(a) are averaged over 1000 randomly chosen
|
||
SOAs. A more detailed analysis reveals that for a considerable number of SOAs,
|
||
2
|
||
RWR actually outperforms RWR for = 1. Table III(a) shows the number of
|
||
2
|
||
times RWR outperforms RWR for various values of . The probability that RWR
|
||
outperforms RWR2 drops rapidly for increasing values of , especially for larger
|
||
alphabet sizes. The last line in Table III(b) lists the probability that RWR derives
|
||
the optimal result, that is, that the smallest language representable by a SORE
|
||
is obtained for expressions of alphabet size 5.
|
||
Although the RWR2 algorithm clearly outperforms RWR in terms of the language size of the derived expression, there is a compelling argument in the
|
||
latter’s favor. In terms of running time, RWR outperforms RWR2 with a few orders of magnitude as is discussed in Section 7.5.
|
||
7.3 Incomplete Data
|
||
Unfortunately, in a real-world setting an available sample may simply contain
|
||
too little information to learn the target regular expression. To formalize this,
|
||
we introduce the notion of coverage.
|
||
Definition 36. A sample S covers a deterministic automaton A if for every
|
||
edge (s, t) in A there is a word w ∈ S whose unique accepting run in A traverses (s, t). Such a word w is called a witness for (s, t). A sample S covers a
|
||
deterministic regular expression r if it covers the automaton obtained from S
|
||
using the Glushkov construction for translating regular expressions into automata [Brüggeman-Klein 1993].
|
||
If a sample S does not contain a witness for an edge, it may seem as if
|
||
the target expression cannot be learned, even if it is a SORE since the SOA
|
||
derived from the data has an edge missing. However, the repair rules introduce
|
||
extra edges, so this part of the algorithm may actually alleviate the problem of
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:39
|
||
|
||
Table IV. Percentage of
|
||
Successfully Derived Expressions
|
||
at Various Values of Sample
|
||
Coverage for CRX, RWR0 , RWR and
|
||
2
|
||
1
|
||
|
||
RWR
|
||
|
||
coverage CRX RWR0 RWR RWR21
|
||
25.0
|
||
85% 56% 12% 73%
|
||
35.0
|
||
87% 48% 32% 73%
|
||
45.0
|
||
96% 60% 57% 74%
|
||
55.0
|
||
87% 58% 63% 57%
|
||
65.0
|
||
82% 48% 58% 59%
|
||
75.0
|
||
80% 51% 51% 63%
|
||
85.0
|
||
63% 48% 47% 53%
|
||
92.5
|
||
57% 48% 47% 61%
|
||
97.5
|
||
85% 74% 64% 73%
|
||
100.0
|
||
100% 100% 100% 100%
|
||
|
||
incomplete data. This is indeed confirmed experimentally. It turns out that even
|
||
with a substantial fraction of missing witnesses, the target regular expression
|
||
can be learned with an astonishing degree of success. To quantify the missing
|
||
information, we introduce the following definition:
|
||
Definition 37. The coverage of a sample with respect to a target expression
|
||
r is the ratio of the number of edges of the SOA derived from the sample and
|
||
the SOA representing the target expression r.
|
||
The tests were done on 100 real-world regular expressions of alphabet sizes
|
||
up to 10, for 10 independently selected samples of varying coverage. The results are presented in Table IV. The straightforward CRX clearly outperforms all
|
||
other algorithms, although this result should be approached with some caution:
|
||
to give CRX a fair chance, the target expressions for this algorithm were limited
|
||
to CHAREs, while the other algorithms were tested on general SOREs as well.
|
||
Note that approximately 90% of real-world expressions are in fact CHAREs,
|
||
hence its superior performance is not only due to simpler target expressions.
|
||
The robustness of RWR21 is quite remarkable since it tends to derive more specific
|
||
regular expressions than RWR0 and RWR. One would expect the generalization
|
||
ability to decrease for algorithms that yield more specific results. This expectation is borne out when one compares RWR0 and RWR, however, RWR21 ’s greedy
|
||
application of the repair rules seems to pay off in the context of incomplete data
|
||
as well.
|
||
7.4 Noise
|
||
As already noted in the Introduction, real-world samples (such as XHTML)
|
||
need not be valid with respect to its known schema. Errors crop up due to
|
||
all sorts of circumstances. This underscores the need for a robust inference
|
||
algorithm that can handle some noise in the input sample.
|
||
Noise can come in several forms. To generate a noisy subsample, we modify
|
||
the target expression either by replacing a symbol by a different one from the
|
||
target’s expression, or by replacing it by a symbol that is not in the alphabet of
|
||
the target expression. We than use the modified target expression to generate
|
||
a complete sample. We define the noise level as follows.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:40
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
Definition 38. Given a target expression r, the noise level of a sample S is
|
||
the ratio |S− L(r)|/|S|.
|
||
Here we propose an approach to filter the sample S based on the probability
|
||
of its words being generated by a probabilistic automaton, as we already used
|
||
in previous work [Bex et al. 2008]. This probabilistic automaton has one state
|
||
for each alphabet symbol, and the transition probabilities are computed using
|
||
the Baum-Welsh algorithm [Rabiner 1989]. Given the probabilistic automaton,
|
||
it is straightforward to compute the probability for each w ∈ S, so that one can
|
||
rank the sample’s words. One expects words that contain noise, that is, that
|
||
would be rejected by the target regular expression, to have low probability if
|
||
their number is not excessively large compared to the sample’s size.
|
||
To filter the sample, hoping to exclude those words that contain noise, we
|
||
compute the mean μ and standard deviation σ of the sample’s probabilities. A
|
||
string w ∈ S with probability P(w) is excluded if P(w) < μ − ασ . The factor α
|
||
is a parameter of the algorithm. The filtered sample S is now used to derive
|
||
a regular expression. It is of course possible that in the generation of S some
|
||
words needed to derive the target expression were removed. Hence there is no
|
||
guarantee that the derived regular expression will be an overapproximation of
|
||
the target expression.
|
||
Since it was shown in previous sections that RWR21 has the best overall performance, we focus solely on this algorithm in this section. In order to investigate
|
||
how robust RWR21 is with respect to noise we applied the algorithm to samples S
|
||
with increasing noise levels with a range of values for the cut-off α. We compute
|
||
the precision and the recall for each individual expression and use the average
|
||
values over many expressions to compute the F-value for a given noise level
|
||
and cut-off so that the optimal cut-off point can be determined.
|
||
To define precision and recall, consider the sample S = Svalid ∪ Sinvalid , where
|
||
Svalid ⊆ S contains the words in S accepted by the target expression and Sinvalid
|
||
contains the words in S not accepted by the target expression. A true positive is
|
||
a word in Svalid that is accepted by the derived expression, while a false negative
|
||
is a word in Svalid that is rejected by the derived expression. Similarly, a false
|
||
positive is a word in Sinvalid that is accepted by the derived expression, while a
|
||
true negative is a word in Sinvalid that is rejected by the derived expression. We
|
||
denote by St.p. the set of true positives, by St.n. the set of true negatives, by Sf .p.
|
||
the set of false positives, and by Sf .n. the set of false negatives.
|
||
Definition 39. The precision p, recall r, and F-value of a derived regular
|
||
expression on a sample S are given by
|
||
p=
|
||
|
||
|St.p. |
|
||
,
|
||
(|St.p. | + |Sf .p. |)
|
||
|
||
r=
|
||
|
||
|St.p. |
|
||
,
|
||
(|St.p. | + |Sf .n. |)
|
||
|
||
F=
|
||
|
||
2 pr
|
||
.
|
||
p+r
|
||
|
||
Furthermore, we are interested in the fraction of derived regular expressions
|
||
that is equivalent to the target expression.
|
||
We average over 580 SOREs obtained from a corpus of real-world DTDs.
|
||
The results are shown in Figure 16(a). From the F-value we can conclude
|
||
that a cut-off value α F ≈ 0.7 yields the best balance between precision and
|
||
recall. Figure 16(b) shows the fraction of derived regular expressions that is
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:41
|
||
|
||
Fig. 16. (a) F-value as a function of the cut-off value α for noise levels of 0.01 (squares), 0.02
|
||
(circles), and 0.05 (triangles). (b) Fraction of derived expressions equivalent to the target expression
|
||
as a function of the cut-off value α for noise levels of 0.01 (squares), 0.02 (circles), and 0.05
|
||
(triangles).
|
||
|
||
equivalent to the target expression. For noise levels increasing from 0.01 to
|
||
0.05, the F-value as well as the percentage of derived expressions equivalent
|
||
to the target expression gradually decreases, as is to be expected. It should be
|
||
noted that recall r < 1 implies that the language represented by the derived
|
||
regular expression is not a superset of the target’s language. For the cut-off α F ,
|
||
and a noise level of 0.01, approximately 16% of the derived regular expressions
|
||
allow false negatives, while the value for a noise level of 0.05 is 15%. The fact
|
||
that the derived expression is not a super-approximation may or may not be
|
||
acceptable, depending on the application.
|
||
Another interesting observation is that the number of derived expressions
|
||
that is equivalent to the target expression increases beyond the cut-off value
|
||
α F ; see Figure 16(b). For a noise level of 0.01, this trend continues up to
|
||
cut-off values of αequiv. ≈ 0.3 where it reaches a maximum of approximately
|
||
53%. However, at this value 20% of the derived regular expressions are not
|
||
super-approximations to their target expressions. For α < αequiv. , the F-value
|
||
decreases rapidly. For higher noise levels, the optimal cut-off value αequiv. is
|
||
smaller, but since it is very unlikely that one knows the noise level, it is hard
|
||
to take advantage of this fact by tuning αequiv. to a specific noise level. The
|
||
overall best result will be obtained for αequiv. ≈ 0 for noise levels not exceeding
|
||
0.05.
|
||
It should be noted that for a noise level of 0.01 at αequiv. , out the 53% of derived
|
||
regular expression that are equivalent to the target expression, about 7% is
|
||
not covered by the sample. The latter illustrates once more the generalization
|
||
ability of the algorithms RWR2 as was discussed in Section 7.3.
|
||
7.5 Performance
|
||
As mentioned previously, the one advantage RWR has over RWR2 is that the
|
||
former’s running time is much lower than the latter’s. This is illustrated in
|
||
Table V(a) for 1000 target expressions of alphabet size 10. It also shows the
|
||
relative running time for RWR0 , illustrating that RWR outperforms both RWR0 and
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:42
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
Table V.
|
||
(a)
|
||
relative running time
|
||
0
|
||
RWR
|
||
6 · 102
|
||
2
|
||
|
||
RWR
|
||
|
||
1
|
||
2
|
||
3
|
||
4
|
||
5
|
||
|
||
2 · 102
|
||
2 · 103
|
||
1 · 104
|
||
4 · 104
|
||
1 · 105
|
||
|
||
(b)
|
||
|| time (ms)
|
||
5
|
||
2
|
||
10
|
||
5
|
||
15
|
||
15
|
||
20
|
||
33
|
||
50
|
||
616
|
||
100
|
||
7562
|
||
|
||
(a) Relative running times of RWR2 versus RWR for various
|
||
values of . (b) Average running times in milliseconds for RWR
|
||
as a function of alphabet size.
|
||
|
||
2
|
||
2
|
||
RWR for any value of . However, it is interesting to note that RWR1 outperforms
|
||
0
|
||
RWR by a factor of 3, and derives more specific regular expressions, again
|
||
illustrating the superiority of the new algorithms over RWR0 .
|
||
|
||
The performance of RWR is excellent: on average it takes only ms to derive
|
||
an expression of alphabet size 10. Table V(b) shows actual running times as a
|
||
function of the target expressions’ alphabet size, averaged over 1000 random
|
||
expressions of that alphabet size.
|
||
With respect to the performance in terms of the number of examples, we
|
||
showed in the conference version of this article that RWR0 ’s was adequate to
|
||
deal with large datasets. Example4 with 61 symbols in Table II is derived from
|
||
10000 example words in 7 seconds while CRX only needs 3.2 seconds. More
|
||
typical expressions of about 10 symbols derived from a few hundred examples
|
||
take approximately a second. These figures include the time to initialize a
|
||
Java Virtual Machine while the tests are done on a 2.5 GHz P4 with 1GB
|
||
of RAM. Given that RWR and RWR21 outperform RWR0 and the time required to
|
||
start the virtual machine and parse the data is independent of the algorithm,
|
||
our new algorithms are adequate as well. For instance, RWR derived a DTD
|
||
for PubMed from 10000 articles with a total size of over 1.2GB in 264 seconds
|
||
(again including the time needed for Java initialization and parsing of the XML
|
||
data). Trang slightly outperforms CRX thanks to very efficient XML parsing. We
|
||
did not make a detailed comparison with XTRACT for the reason that XTRACT
|
||
cannot handle samples with more than 1000 words.
|
||
8. EXTENSIONS
|
||
Incremental computation. Especially in the setting of sparse data when over
|
||
time more XML data gets generated, for instance, by answers to queries or
|
||
results of calls to Web services, it is desirable to update an already generated
|
||
schema based on the newly arrived XML data only. Such an approach is possible
|
||
for both RWR and CRX: as both algorithms make use of an internal representation
|
||
(automata or partial orders), we only need to update that representation. So, for
|
||
every element name we store the corresponding internal graph representation,
|
||
which is only quadratic in the number of different element names, and we can
|
||
forget about the XML data that generated it. Actually, for CRX, to assign the
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:43
|
||
|
||
qualifiers ?, + and ∗, we also need to remember for each element name how
|
||
it occurs (always exactly once, always more than once, . . . ), but this is only a
|
||
constant amount of information.
|
||
Numerical predicates. An immediate drawback of SOREs is that they cannot count. For instance, they cannot express aabb+ specifying that a string
|
||
should start with two a’s followed by any number of b’s larger than 1. XML
|
||
Schema even uses dedicated attributes for expressing the desired number of
|
||
repetitions.
|
||
<xs:sequence>
|
||
<xs:element name="a" minOccurs=2 maxOccurs=2/>
|
||
<xs:element name="b" minOccurs=2 maxOccurs="unbounded"/>
|
||
</xs:sequence>
|
||
|
||
In the same way, REs can be extended by numerical predicates: when r is
|
||
an RE and i is a natural number then r ≥i and r =i are also REs. They are
|
||
semantically equivalent to r i r ∗ and r i , respectively, where r i = r · r · · · · · r (i
|
||
times). The preceding expression can then be expressed as a=2 b≥2 . To both RWR
|
||
and CRX a post-processing step can be added that rewrites + and ∗ to numerical
|
||
values based on exact occurrences of element names in the XML data.
|
||
Generation of XSDs. While the inference of DTDs essentially reduces to the
|
||
inference of regular expressions from sets of sample words (as illustrated in
|
||
Section 1.1), the inference of XSDs is much more complex.
|
||
Indeed, first and foremost, the content model of an element can only depend
|
||
on the element’s name in a DTD. XML Schema, in contrast, has a typing
|
||
mechanism that allows the content model of an element to depend not only on
|
||
its name, but also on the context in which it is used. We refer the interested
|
||
reader to Martens et al. [2006, 2007] for an in-depth discussion on the XML
|
||
Schema typing mechanism and the extra expressive power that it provides with
|
||
respect to DTDs. It is important to note, however, that the study of Martens
|
||
et al. [2006] also shows that 85% of XSDs in practice does not use this additional
|
||
power, and are hence structurally equivalent to a DTD. Obviously, inferring
|
||
such XSDs is merely a matter of using the correct syntax. How to extend
|
||
schema inference to deal with real XSDs that do use the additional power of
|
||
the XML Schema typing system is studied in a companion article [Bex et al.
|
||
2007].
|
||
Second, DTDs have essentially only one atomic data type to describe the
|
||
textual data found in XML documents: #PCDATA. XML Schema, in contrast, has
|
||
atomic data types for numbers, strings, dates, etc. The algorithms described
|
||
here can easily be extended with heuristics to recognize these atomic data
|
||
types, such as the ones described by Hegewald et al. [2006].
|
||
Inference of k-OREs. As the vast majority of expressions used in practical
|
||
schemas are SOREs, we focused in this article on the inference of SOREs. In
|
||
a companion article [Bex et al. 2008] we study the derivation of k-OREs, for
|
||
small values of k, thus covering virtually all expressions occurring in practice.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:44
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
9. CONCLUSION
|
||
We introduced novel algorithms for the inference of concise regular expressions
|
||
from positive data. For the inference of SOREs, RWR2 was shown to yield the best
|
||
experimental results. It is also quite robust when presented with incomplete
|
||
and noisy data. The quality of inferred expressions on real-world and synthetic
|
||
datasets outperforms those returned by XTRACT where CRX is similar to Trang.
|
||
CRX’ generalization ability makes it highly qualified in dealing with very small
|
||
datasets. Further, RWR, RWR2 , and CRX always infer succinct expressions by definition which can easily be interpreted by humans. Of independent interest, we
|
||
introduced a new algorithm to transform automata into short, readable regular
|
||
expressions.
|
||
ELECTRONIC APPENDIX
|
||
The electronic appendix for this article can be accessed in the ACM Digital
|
||
Library.
|
||
ACKNOWLEDGMENTS
|
||
|
||
We thank the authors of Garofalakis et al. [2003] for making available
|
||
XTRACT’s source code, as well as Wouter Gelade for comments on a previous draft of this article.
|
||
REFERENCES
|
||
ABITEBOUL, S., BUNEMAN, P., AND SUCIU, D. 1999. Data on the Web. Morgan Kaufmann Publishers.
|
||
AHONEN, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. thesis, Report A-1996-4. Department of Computer Science, University of
|
||
Helsinki.
|
||
ANGLUIN, D. AND SMITH, C. H. 1983. Inductive inference: Theory and methods. ACM Comput.
|
||
Surv. 15, 3, 237–269.
|
||
BARBOSA, D., MENDELZON, A. O., KEENLEYSIDE, J., AND LYONS, K. A. 2002. ToXgene: An extensible
|
||
template-based data generator for XML. In Proceedings of the 5th International Workshop on the
|
||
Web and Databases (WebDB 2002). 49–54.
|
||
BARBOSA, D., MIGNET, L., AND VELTRI, P. 2006. Studying the XML web: Gathering statistics from
|
||
an XML sample. World Wide Web 9, 2, 187–212.
|
||
BENEDIKT, M., FAN, W., AND GEERTS, F. 2008. XPath satisfiability in the presence of DTDs. J.
|
||
ACM 55, 2, 1–79.
|
||
BERNSTEIN, P. A. 2003. Applying model management to classical meta data problems. In Online
|
||
Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR’03).
|
||
BEX, G. J., GELADE, W., NEVEN, F., AND VANSUMMEREN, S. Learning deterministic regular expressions
|
||
for the inference of schemas from XML data. http://arxiv.org/abs/1004.2372.
|
||
BEX, G. J., GELADE, W., NEVEN, F., AND VANSUMMEREN, S. 2008. Learning deterministic regular
|
||
expressions for the inference of schemas from XML data. In Proceeding of the 17th International
|
||
Conference on World Wide Web (WWW’08). 825–834.
|
||
BEX, G. J., NEVEN, F., AND DEN BUSSCHE, J. V. 2004. DTDs versus XML Schema: A practical study.
|
||
In Proceedings of the International Workshop on Web and Database (WebDB). S. Amer-Yahia and
|
||
L. Gravano, Eds. 79–84.
|
||
BEX, G. J., NEVEN, F., SCHWENTICK, T., AND TUYLS, K. 2006. Inference of concise DTDs from XML
|
||
data. In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y.
|
||
Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds.
|
||
ACM, 115–126.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:45
|
||
|
||
BEX, G. J., NEVEN, F., AND VANSUMMEREN, S. 2007. Inferring XML schema definitions from XML
|
||
data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07).
|
||
998–1009.
|
||
BRĀZMA, A. 1993. Efficient identification of regular expressions from representative examples. In
|
||
Proceedings of the 6th Annual Conference on Computational Learning Theory (COLT’93). ACM
|
||
Press, 236–242.
|
||
BRÜGGEMAN-KLEIN, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2,
|
||
197–213.
|
||
BRÜGGEMANN-KLEIN, A. AND WOOD, D. 1998. One-Unambiguous regular languages. Inform. Comput. 140, 2, 229–253.
|
||
BUNEMAN, P., DAVIDSON, S. B., FERNANDEZ, M. F., AND SUCIU, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT’97).
|
||
Lecture Notes in Computer Science, vol. 1186. Springer, 336–350.
|
||
CARON, P. AND ZIADI, D. 2000. Characterization of Glushkov automata. Theor. Comput. Sci. 233, 1–
|
||
2, 75–90.
|
||
Castor. The Castor project. www.castor.org.
|
||
CHIDLOVSKII, B. 2001. Schema extraction from XML: A grammatical inference approach. In
|
||
Proceedings of the 8th International Workshop on Knowledge Representation meets Databases
|
||
(KRDB’01). CEUR Workshop Proceedings, vol. 45.
|
||
CLARK,
|
||
J.
|
||
Trang:
|
||
Multi-Format
|
||
schema
|
||
converter
|
||
based
|
||
on
|
||
RELAX
|
||
NG.
|
||
www.thaiopensource.com/relaxng/trang.html.
|
||
COVER, R. 2003. The Cover Pages. xml.coverpages.org.
|
||
DELGADO, M. AND MORAIS, J. 2004. Approximation to the smallest regular expression for a given
|
||
regular language. In Proceedings of the, 9th International Conference on Implementation and
|
||
Application of Automata. Lecture Notes in Computer Science, vol. 3317. Springer, 312–314.
|
||
DEUTSCH, A., FERNANDEZ, M. F., AND SUCIU, D. 1999. Storing semistructured data with STORED.
|
||
In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM
|
||
Press, 431–442.
|
||
EHRENFEUCHT, A. AND ZEIGER, P. 1976. Complexity measures for regular expressions. J. Comput.
|
||
Syst. Sci. 12, 134–146.
|
||
FERNANDEZ, M. F. AND SUCIU, D. 1998. Optimizing regular path expressions using graph schemas.
|
||
In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 14–
|
||
23.
|
||
FERNAU, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications.
|
||
Lecture Notes in Artificial Intelligence, vol. 3264. Springer, 277–278.
|
||
FERNAU, H. 2009. Algorithms for learning regular expressions from positive data. Inform. Comput. 207, 4, 521–541.
|
||
FLORESCU, D. 2005. Managing semi-structured data. ACMQueue 3, 8, 18–24.
|
||
GARCÍA, P. AND VIDAL, E. 1990. Inference of k-testable languages in the strict sense and application
|
||
to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920–925.
|
||
GAROFALAKIS, M., GIONIS, A., RASTOGI, R., SESHADRI, S., AND SHIM, K. 2003. XTRACT: Learning
|
||
document type descriptors from XML document collections. Data Mining Knowl. Discov. 7, 23–
|
||
56.
|
||
GELADE, W. AND NEVEN, F. 2008. Succinctness of the complement and intersection of regular
|
||
expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer
|
||
Science (STACS’08). Dagstuhl Seminar Proceedings, vol. 08001. 325–336.
|
||
GOLD, E. 1967. Language identification in the limit. Inform. Control 10, 5, 447–474.
|
||
GOLDMAN, R. AND WIDOM, J. 1997. DataGuides: Enabling query formulation and optimization in
|
||
semistructured databases. In Proceedings of the 23rd International Conference on Very Large
|
||
Data Bases (VLDB’97). 436–445.
|
||
GRUBER, H. AND HOLZER, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the 35th International Colloquium on Automata, Languages and
|
||
Programming. Lecture Notes in Computer Science, vol. 5126. Springer, 39–50.
|
||
HAN, Y.-S. AND WOOD, D. 2007. Obtaining shorter regular expressions from finite-state automata.
|
||
Theor. Comput. Sci. 370, 1–3, 110–120.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
11:46
|
||
|
||
•
|
||
|
||
G. J. Bex et al.
|
||
|
||
HEGEWALD, J., NAUMANN, F., AND WEIS, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the 22nd International Conference on Data
|
||
Engineering Workshops (ICDEW’06). IEEE Computer Society, 81–97.
|
||
HINKELMAN, S. 2005. Business integration—Information conformance statements (BI-ICS). Tech.
|
||
rep., IBM DeveloperWorks.
|
||
HOPCROFT, J. AND ULLMAN, J. 1979. Introduction to Automata Theory, Languages and computation.
|
||
Addison-Wesley.
|
||
HUET, G. 1980. Confluent reductions: Abstract properties and applications to term rewriting
|
||
systems. J. ACM 27, 4, 797–821.
|
||
KOCH, C., SCHERZINGER, S., SCHWEIKARDT, N., AND STEGMAIER, B. 2004. Schema-Based scheduling of
|
||
event processors and buffer minimization for queries on structured data streams. In Proceedings
|
||
of the 30th International Conference on Very Large Data Bases (VLDB’04). 228–239.
|
||
MANOLESCU, I., FLORESCU, D., AND KOSSMANN, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases
|
||
(VLDB’01). 241–250.
|
||
MARTENS, W., NEVEN, F., AND SCHWENTICK, T. 2007. Simple off the shelf abstractions for XML
|
||
schema. SIGMOD Rec. 36, 3, 15–22.
|
||
MARTENS, W., NEVEN, F., SCHWENTICK, T., AND BEX, G. J. 2006. Expressiveness and complexity of
|
||
XML schema. ACM Trans. Data. Syst. 31, 3.
|
||
MCHUGH, J., ABITEBOUL, S., GOLDMAN, R., QUASS, D., AND WIDOM, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54–66.
|
||
MELNIK, S. 2004. Generic model management: Concepts and algorithms. Ph.D. thesis, University
|
||
of Leipzig.
|
||
MIGNET, L., BARBOSA, D., AND VELTRI, P. 2003. The XML web: A first study. In Proceedings of the
|
||
12th International World Wide Web Conference. 500–510.
|
||
MIKLAU, G. 2002. XMLData repository. www.cs.washington.edu/research/xmldatasets.
|
||
MIN, J.-K., AHN, J.-Y., AND CHUNG, C.-W. 2003. Efficient extraction of schemas for XML documents.
|
||
Inform. Process. Lett. 85, 1, 7–12.
|
||
NESTOROV, S., ABITEBOUL, S., AND MOTWANI, R. 1998. Extracting schema from semistructured data.
|
||
In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM
|
||
Press, 295–306.
|
||
NESTOROV, S., ULLMAN, J. D., WIENER, J. L., AND CHAWATHE, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International
|
||
Conference on Data Engineering. IEEE Computer Society, 79–90.
|
||
NEVEN, F. AND SCHWENTICK, T. 2006. On the complexity of XPath containment in the presence of
|
||
disjunction, DTDs, and variables. Logical Methods Comput. Sci. 2, 3.
|
||
NGU, A. H. H., ROCCO, D., CRITCHLOW, T., AND BUTTLER, D. 2005. Automatic discovery and inferencing of complex bioinformatics web interfaces. World Wide Web 8, 4, 463–493.
|
||
OAKS, P. AND TER HOFSTEDE, A. H. M. 2007. Guided interaction: A mechanism to enable ad hoc
|
||
service interaction. Inform. Syst. Frontiers 9, 1, 29–51.
|
||
OHLEBUSCH, E. 2001. Implementing conditional term rewriting by graph rewriting. Theor. Comput. Sci. 262, 1, 311–331.
|
||
OPEN WEB APPLICATION SECURITY PROJECT CONSORTIUM. 2004. The top ten most critical web application security vulnerabilities—2004 update. www.owasp.org.
|
||
PITT, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the
|
||
International Workshop on Analogical and Inductive Inference (AII’89). Springer-Verlag, 18–
|
||
44.
|
||
RABINER, L. 1989. A tutorial on hidden Markov models and selected applications in speech
|
||
recognition. Proc. IEEE 77, 2, 257–286.
|
||
RAHM, E. AND BERNSTEIN, P. A. 2001. A survey of approaches to automatic schema matching.
|
||
VLDB J. 10, 4, 334–350.
|
||
SAHUGUET, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask
|
||
(extended abstract). In Proceedings of the 3rd International Workshop on The World Wide Web
|
||
and Databases, (WebDB’00), Selected Papers. 171–183.
|
||
SAKAKIBARA, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1,
|
||
15–45.
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
Inference of Concise Regular Expressions and DTDs
|
||
|
||
•
|
||
|
||
11:47
|
||
|
||
SANKEY, J. AND WONG, R. K. 2001. Structural inference for semistructured data. In Proceedings of
|
||
the International Conference on Information and Knowledge Management. ACM Press, 159–166.
|
||
Sun. Sun JAXB. java.sun.com/webservices/jaxb.
|
||
THOMPSON, H. S., BEECH, D., MALONEY, M., AND MENDELSOHN, N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium, Recommendation REC-xmlschema-1-20041028.
|
||
W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language, 2nd Ed. W3C.
|
||
WANG, G., LIU, M., YU, J. X., SUN, B., YU, G., LV, J., AND LU, H. 2003. Effective schema-based XML
|
||
query optimization techniques. In Proceedings of the 7th International Database Engineering
|
||
and Applications Symposium. 230–235.
|
||
Received January 2009; revised July 2009; accepted November 2009
|
||
|
||
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
||
|
||
|