2492 lines
122 KiB
Text
2492 lines
122 KiB
Text
|
|
Inference of Concise Regular Expressions
|
|||
|
|
and DTDs
|
|||
|
|
GEERT JAN BEX and FRANK NEVEN
|
|||
|
|
Hasselt University and Transnational University of Limburg
|
|||
|
|
THOMAS SCHWENTICK
|
|||
|
|
Dortmund University
|
|||
|
|
and
|
|||
|
|
STIJN VANSUMMEREN
|
|||
|
|
Université Libre de Bruxelles
|
|||
|
|
|
|||
|
|
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set
|
|||
|
|
of XML-documents, a problem that basically reduces to learning concise regular expressions from
|
|||
|
|
positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the
|
|||
|
|
far majority of expressions used in practical DTDs. For the inference of SOREs we present several
|
|||
|
|
algorithms that first infer an automaton for a given set of example strings and then translate that
|
|||
|
|
automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE
|
|||
|
|
can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available,
|
|||
|
|
however (for instance when the data is generated by Web service requests or by answers to queries),
|
|||
|
|
these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel
|
|||
|
|
learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without
|
|||
|
|
going through an automaton representation. We show that CRX performs very well within its target
|
|||
|
|
class on very small datasets.
|
|||
|
|
|
|||
|
|
This research was done while S. Vansummeren was a Postdoctoral Fellow of the Research
|
|||
|
|
Foundation-Flanders (FWO) at Hasselt University.
|
|||
|
|
This work was funded by FWO-G.0821.09N and the Future and Emerging Technologies (FET)
|
|||
|
|
programme within the Seventh Framework Programme for Research of the European Commision,
|
|||
|
|
under the FET-Open grant agreement FOX, number FP7-ICT-233599.
|
|||
|
|
Authors’ addresses: G. J. Bex and F. Neven, Database and Theoretical Computer Science Research Group, Hasselt University and Transnational University of Limburg, Agoralaan, gebouw D,
|
|||
|
|
B-3590 Diepenbeek Belgium; email: {geertjan.bex, frank.neven}@uhasselt.be; T. Schwentick, TU
|
|||
|
|
Dortmund, Fakultät für Informatik, Otto-Hahn-Str. 16, Raum 214, 44227 Dortmund, Germany.
|
|||
|
|
email: thomas.schwentick@udo.edu; S. Vansummeren, Research Laboratory for Web and Information Technologies (WIT), Université Libre de Bruxelles, 50 Av. F. Roosevelt, CP 165/15 B-1050
|
|||
|
|
Brussels, Belgium; email: stijn.vansummeren@ulb.ac.be.
|
|||
|
|
Permission to make digital or hard copies of part or all of this work for personal or classroom use
|
|||
|
|
is granted without fee provided that copies are not made or distributed for profit or commercial
|
|||
|
|
advantage and that copies show this notice on the first page or initial screen of a display along
|
|||
|
|
with the full citation. Copyrights for components of this work owned by others than ACM must be
|
|||
|
|
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
|
|||
|
|
to redistribute to lists, or to use any component of this work in other works requires prior specific
|
|||
|
|
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
|
|||
|
|
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
|
|||
|
|
2010 ACM 0362-5915/2010/04-ART11 $10.00
|
|||
|
|
C
|
|||
|
|
DOI 10.1145/1735886.1735890 http://doi.acm.org/10.1145/1735886.1735890
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11
|
|||
|
|
|
|||
|
|
11:2
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Categories and Subject Descriptors: F.4.3 [Mathematical Logic and Formal Languages]:
|
|||
|
|
Formal Languages; H.2.1 [Database Management]: Logical Design; I.2.6 [Artificial Intelligence]: Learning; I.7.2 [Document and Text Processing]: Document Preparation
|
|||
|
|
General Terms: Algorithms, Languages, Theory
|
|||
|
|
Additional Key Words and Phrases: Regular expressions, schema inference, XML
|
|||
|
|
ACM Reference Format:
|
|||
|
|
Bex, G. J., Neven, F., Schwentick, T., and Vansummeren, S. 2010. Inference of concise regular
|
|||
|
|
expressions and DTDs. ACM Trans. Datab. Syst, 35. 2, Article 11 (April 2010), 47 pages.
|
|||
|
|
DOI = 10.1145/1735886.1735890 http://doi.acm.org/10.1145/1735886.1735890
|
|||
|
|
|
|||
|
|
1. INTRODUCTION
|
|||
|
|
The eXtensible Markup Language (XML) serves as the lingua franca for data
|
|||
|
|
exchange on the Internet [Abiteboul et al. 1999]. Because XML documents
|
|||
|
|
in general can be of any form, most communities and applications impose
|
|||
|
|
structural constraints on the documents that are to be exchanged or processed.
|
|||
|
|
These constraints can be formally specified in a schema, which is written in a
|
|||
|
|
schema language such as the Document Type Definitions (DTDs) or the XML
|
|||
|
|
Schema Definitions (XSDs) [Thompson et al. 2004].
|
|||
|
|
The advantages offered by the presence of a fully specified schema are
|
|||
|
|
numerous. First and foremost, a schema allows automatic validation of the
|
|||
|
|
input document structure, which not only facilitates automatic processing but
|
|||
|
|
also ensures soundness of the input. Unvalidated input data from Web requests
|
|||
|
|
is considered as the number one vulnerability for Web applications [Open Web
|
|||
|
|
Application Security Project Consortium 2004]. The presence of a schema also
|
|||
|
|
allows for automation and optimization of search, integration, and processing
|
|||
|
|
of XML data (refer to, e.g., Benedikt et al. [2008], Deutsch et al. [1999], Koch
|
|||
|
|
et al. [2004], Manolescu et al. [2001], Neven and Schwentick [2006], Wang
|
|||
|
|
et al. [2003]). Moreover, various software development tools such as Castor
|
|||
|
|
[Castor] and SUN’s JAXB [Sun] rely on schemas to perform object-relational
|
|||
|
|
mappings for persistence. Furthermore, the existence of schemas is imperative
|
|||
|
|
when integrating (meta) data through schema matching [Rahm and Bernstein
|
|||
|
|
2001] and in the area of generic model management [Bernstein 2003; Melnik
|
|||
|
|
2004]. A final advantage of a schema is that it assigns meaning to the data.
|
|||
|
|
That is, it provides a user with a concrete semantics of the document and
|
|||
|
|
aids in the specification of meaningful queries over XML data. Although the
|
|||
|
|
examples mentioned here just scrape the surface of current applications,
|
|||
|
|
they already underscore the importance of schemas accompanying XML
|
|||
|
|
data.
|
|||
|
|
Unfortunately, in spite of the aforementioned advantages, the presence of
|
|||
|
|
a schema is not mandatory and many XML documents are not accompanied
|
|||
|
|
by one. For instance, in a recent study Mignet et al. [2003] and Barbosa et al.
|
|||
|
|
[2006] have shown that approximately half of the XML documents available
|
|||
|
|
on the Web do not refer to a schema. In another study Bex et al. [2004] and
|
|||
|
|
Martens et al. [2006] have noted that about two-thirds of XSDs gathered from
|
|||
|
|
schema repositories and from the Web are not valid with respect to the W3C
|
|||
|
|
XML Schema specification [Thompson et al. 2004], rendering them essentially
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:3
|
|||
|
|
|
|||
|
|
useless for immedidate application. A similar observation was made by
|
|||
|
|
Sahuguet [2000] concerning DTDs.
|
|||
|
|
Based on the lack of schemas in practice, it is essential to devise algorithms
|
|||
|
|
that can infer a schema for a given collection of XML documents when none, or
|
|||
|
|
no syntactically correct one, is present. This is also acknowledged by Florescu
|
|||
|
|
[2005] who emphasizes that in the context of data integration:
|
|||
|
|
“We need to extract good-quality schemas automatically from existing data and perform incremental maintenance of the generated
|
|||
|
|
schemas.”
|
|||
|
|
In this article, we describe two novel schema inference algorithms outperforming existing systems in accuracy, conciseness, and speed.
|
|||
|
|
It should be noted that even when a schema is already available, there
|
|||
|
|
are situations where inference can be useful. One such situation is schema
|
|||
|
|
cleaning: sometimes a schema is too general with respect to the XML data
|
|||
|
|
that it is supposed to describe. In that case, it can be advantageous to infer a new schema based solely on the data at hand. This situation is nicely
|
|||
|
|
illustrated by the following real-world example taken from the Protein Sequence Database DTD [Miklau 2002], which gives the following definition for
|
|||
|
|
the refinfo-element.
|
|||
|
|
authors, citation, volume?, month?, year,
|
|||
|
|
pages?, (title | description)?, xrefs?
|
|||
|
|
An analysis of the available XML corpus (683MB of data) with our inference
|
|||
|
|
algorithms yields following more precise expression for the refinfo-element.
|
|||
|
|
authors, citation, (volume | month), year,
|
|||
|
|
pages?, (title | description)?, xrefs?
|
|||
|
|
Note that the latter is more strict than the former, as it emphasizes that volume
|
|||
|
|
and month do not occur together: either one specifies a month of publication for
|
|||
|
|
a given journal article, or the volume that it has appeared in, but not both.
|
|||
|
|
As this example illustrates, schema inference algorithms can hence be used to
|
|||
|
|
better understand the semantics of a given XML dataset, making it possible to
|
|||
|
|
adapt an existing schema when necessary. In general, schema inference can be
|
|||
|
|
used to restrict schemas to a relevant subset of data needed by the application
|
|||
|
|
at hand, thereby facilitating difficult tasks like schema matching and data
|
|||
|
|
integration. Indeed, as argued by Hinkelman [2005], industry-level standards
|
|||
|
|
are too loosely defined in general, which can result in XML schemas where
|
|||
|
|
many business structures are formally specified as being optional.
|
|||
|
|
The second situation where schema inference is useful even though a schema
|
|||
|
|
already exists is in the presence of noisy XML data. In such a situation, part or
|
|||
|
|
all of the data that needs to be processed is rejected by the existing schema. For
|
|||
|
|
instance, we have harvested and investigated a corpus of XHTML documents
|
|||
|
|
from the Web and found that an astonishing 89% of 2092 documents was not
|
|||
|
|
valid with respect to the XHTML Transitional specification [W3C 2002]. In this
|
|||
|
|
case, the inference of a new schema based on the corpus and its comparison
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:4
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 1. An example DTD.
|
|||
|
|
|
|||
|
|
with the XHTML Transitional specification provides a uniform view of the kind
|
|||
|
|
of errors made. Further, given that one often has no choice but to deal with such
|
|||
|
|
noisy data, one may infer a new schema from a subset of the corpus (deleting
|
|||
|
|
documents that make unacceptable errors) and work with that schema rather
|
|||
|
|
than with the official specification to retain at least a minimal validation.
|
|||
|
|
1.1 Problem Setting
|
|||
|
|
Based on the previous observations, it is hence essential to devise algorithms
|
|||
|
|
that can automatically infer a DTD or XSD from a given corpus of XML
|
|||
|
|
documents.
|
|||
|
|
As illustrated in Figure 1, a DTD is essentially a mapping d from element
|
|||
|
|
names to regular expressions over element names. An XML document is valid
|
|||
|
|
with respect to d if for every occurrence of an element name e in the document,
|
|||
|
|
the word formed by its children belongs to the language of the corresponding
|
|||
|
|
regular expression d(e). For instance, the DTD in Figure 1 requires each store
|
|||
|
|
element to have zero or more order children, which must be followed by a
|
|||
|
|
stock element. Likewise, each order must have a customer child, which must
|
|||
|
|
be followed by one or more item elements.
|
|||
|
|
To infer a DTD from a corpus of XML documents C it hence suffices to look,
|
|||
|
|
for each element name e that occurs in a document in C, at the set of element
|
|||
|
|
name words that occur below e in C, and to infer from this set the corresponding
|
|||
|
|
regular expression d(e). As such, the inference of DTDs reduces to the inference of regular expressions from sets of positive example words. To illustrate,
|
|||
|
|
from the words id price, id qty supplier, and id qty item item appearing under <item> elements in a sample XML corpus, we could derive the following
|
|||
|
|
rule.
|
|||
|
|
item → (id, price | (qty, (supplier | item+ )))
|
|||
|
|
While the inference of XSDs is more complicated than the inference of DTDs,
|
|||
|
|
recent characterizations [Martens et al. 2006] show that the structural core of
|
|||
|
|
XML schema (that is, the sets of trees that are definable by XSDs) correspond
|
|||
|
|
to DTDs extended with vertical regular expressions. Therefore, one cannot
|
|||
|
|
hope to successfully infer XSDs without good algorithms for inferring regular
|
|||
|
|
expressions. As such, we focus in this article on the inference of regular expressions (and therefore, by the preceding reduction, on the inference of DTDs).
|
|||
|
|
The inference of XSDs, building on the algorithms presented here, is treated in
|
|||
|
|
a companion article [Bex et al. 2007].
|
|||
|
|
In particular, let be a fixed set of alphabet symbols (also called element
|
|||
|
|
names), and let ∗ be the set of all words over .
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:5
|
|||
|
|
|
|||
|
|
Definition 1 (Regular Expressions). In this article, we are interested in
|
|||
|
|
learning regular expressions r, s of the form
|
|||
|
|
r, s ::= ∅ | ε | a | r . s | r + s | r? | r + ,
|
|||
|
|
where parentheses may be added to avoid ambiguity. Here, ε denotes the empty
|
|||
|
|
word; a ranges over symbols in ; r . s denotes concatenation; r + s denotes
|
|||
|
|
disjunction; r + denotes one-or-more repetitions; and r? denotes the optional
|
|||
|
|
regular expression. That is, the language L(r) accepted by regular expression
|
|||
|
|
r is given by
|
|||
|
|
L(∅) = ∅
|
|||
|
|
L(ε) = {ε}
|
|||
|
|
L(a) = {a}
|
|||
|
|
L(r . s) = {vw | v ∈ L(r), w ∈ L(s)}
|
|||
|
|
L(r + s) = L(r) ∪ L(s)
|
|||
|
|
L(r + ) = {v1 . . . vn | n ≥ 1 and v1 , . . . , vn ∈ L(r)}
|
|||
|
|
L(r?) = L(r) ∪ {ε}.
|
|||
|
|
For convenience, we sometimes omit the concatenation symbol, simply writing rs for r.s. Note that the Kleene star operator (denoting zero or more repititions as in r ∗ ) is not allowed by the preceding syntax. This is not a restriction,
|
|||
|
|
since r ∗ can always be represented as (r + )? or (r?)+ . Conversely, the latter can
|
|||
|
|
always be rewritten into the former for presentation to the user. Also note that
|
|||
|
|
the previous syntax uses r + s, to denote disjunction rather than the vertical
|
|||
|
|
bar notation r | s used by DTDs. The former notation should not be confused
|
|||
|
|
with the one-ore-more repetition operator r + , where the plus symbol is used in
|
|||
|
|
the exponent.
|
|||
|
|
The class of all regular expressions is actually too large for our purposes,
|
|||
|
|
as both DTDs and XSDs require the regular expressions occurring in them to
|
|||
|
|
be deterministic (also sometimes called one-unambiguous [Brüggemann-Klein
|
|||
|
|
and Wood 1998]). Intuitively, a regular expression is deterministic if, without
|
|||
|
|
looking ahead in the input word, it allows to match each symbol of that word
|
|||
|
|
uniquely against a position in the expression when processing the input in
|
|||
|
|
one pass from left to right. For instance, (a + b)∗ a is not deterministic as already the first symbol in the word aaa could be matched by either the first or
|
|||
|
|
the second a in the expression. Without lookahead, it is impossible to know
|
|||
|
|
which one to choose. The equivalent expression b∗ a(b∗ a)∗ , on the other hand, is
|
|||
|
|
deterministic.
|
|||
|
|
Definition 2. Let r stand for the regular expression obtained from r by
|
|||
|
|
replacing the ith occurrence of alphabet symbol a in r by a(i) , for every i and
|
|||
|
|
+
|
|||
|
|
+
|
|||
|
|
a. For example, for r = b+ a(ba+ )? we have r = b(1) a(1) (b(2) a(2) )?. A regular
|
|||
|
|
expression r is deterministic if there are no words wa(i) v and wa( j) v in L(r)
|
|||
|
|
such that i = j.
|
|||
|
|
Equivalently, an expression is deterministic if the so-called Glushkov construction [Brüggeman-Klein 1993] translates it into a deterministic finite
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:6
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
automaton rather than a nondeterministic one [Brüggemann-Klein and Wood
|
|||
|
|
1998]. Not every nondeterministic regular expression is equivalent to a deterministic one [Brüggemann-Klein and Wood 1998]. Thus, semantically, the class
|
|||
|
|
of deterministic regular expressions forms a strict subclass of the class of all
|
|||
|
|
regular expressions.
|
|||
|
|
Learning in the limit. For the purpose of inferring DTDs from XML data,
|
|||
|
|
we are hence in search of an algorithm that, given enough sample words of a
|
|||
|
|
target deterministic regular expression r, returns a deterministic expression r
|
|||
|
|
equivalent to r. In the framework of learning in the limit [Gold 1967], such an
|
|||
|
|
algorithm is said to learn the deterministic regular expressions from positive
|
|||
|
|
data.
|
|||
|
|
Definition 3. Define a sample to be a finite subset of ∗ and let R be
|
|||
|
|
a subclass of the regular expressions. An algorithm M mapping samples to
|
|||
|
|
expressions in R is said to learn R from positive data if: (1) S ⊆ L(M(S)) for
|
|||
|
|
every sample Sand (2) to every r ∈ R we can associate a so-called characteristic
|
|||
|
|
sample Sr ⊆ L(r) such that, for each sample S with Sr ⊆ S ⊆ L(r), M(S) is
|
|||
|
|
equivalent to r.
|
|||
|
|
Intuitively, the first condition says that M must be sound; the second that
|
|||
|
|
M must be complete, given enough data. A class of regular expressions R is
|
|||
|
|
learnable in the limit from positive data if an algorithm exists that learns R.
|
|||
|
|
For the class of all regular expressions, it was shown by Gold [1967] that no
|
|||
|
|
such algorithm exists. The same holds for the class of deterministic regular
|
|||
|
|
expressions, as shown in our companion article [Bex et al. 2008].
|
|||
|
|
PROPOSITION 4 (BEX ET AL. 2008). The class of deterministic regular expressions is not learnable in the limit from positive data.
|
|||
|
|
Proposition 4 immediately excludes the possibility for an algorithm to infer
|
|||
|
|
the full class of DTDs. In practice, however, regular expressions occurring in
|
|||
|
|
DTDs and XSDs are concise rather than arbitrarily complex. Indeed, a study
|
|||
|
|
of 819 DTDs and XSDs gathered from the Cover Pages [Cover 2003] (including
|
|||
|
|
many high-quality XML standards) as well as from the Web at large, revealed
|
|||
|
|
that regular expressions occurring in practical schemas are such that every
|
|||
|
|
alphabet symbol occurs at most k times, with k small. Actually, in 98% of the
|
|||
|
|
cases k = 1.
|
|||
|
|
Definition 5. A regular expression is k-occurrence if every alphabet symbol
|
|||
|
|
occurs at most k times in it.
|
|||
|
|
For example, the expressions customer . order+ and (school + institute)+
|
|||
|
|
are both 1-occurrence, while id .(qty + id) is 2-occurrence (as id occurs twice).
|
|||
|
|
Observe that if r is k-occurrence, then it is also l-occurrence for every l ≥ k.
|
|||
|
|
To simplify notation, we often abbreviate “k-occurrence regular expression” by
|
|||
|
|
k-ORE and also refer to the 1-OREs as “single occurrence regular expressions”
|
|||
|
|
or SOREs.
|
|||
|
|
Note that, since every alphabet symbol can occur at most once in a SORE,
|
|||
|
|
every SORE is necessarily deterministic. Indeed, we have the following strict
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:7
|
|||
|
|
|
|||
|
|
inclusion hierarchy among the various classes of regular expressions just
|
|||
|
|
discussed.
|
|||
|
|
SOREs
|
|||
|
|
⊂ 2-OREs ⊂ 3-OREs ⊂ · · · ⊂ k-OREs
|
|||
|
|
⊂
|
|||
|
|
⊂
|
|||
|
|
deterministic regex
|
|||
|
|
⊂
|
|||
|
|
all regex
|
|||
|
|
(For k ≥ 2, the classes of k-OREs and deterministic regular expressions are
|
|||
|
|
incomparable.) Given their importance in practical schemas, we focus in this
|
|||
|
|
article on the inference of SOREs. The inference of deterministic k-OREs for
|
|||
|
|
k > 1 is treated in a companion article [Bex et al. 2008].
|
|||
|
|
1.2 Outline and Contributions
|
|||
|
|
In particular, we show in Section 3 that the class of SOREs can be efficiently
|
|||
|
|
learned in the limit from positive data by first constructing an automaton
|
|||
|
|
representation of the target SORE using techniques of Garcı́a and Vidal [1990],
|
|||
|
|
and by subsequently transforming this automaton into an equivalent SORE (if
|
|||
|
|
such a SORE exists) using a novel polynomial-time algorithm called REWRITE.
|
|||
|
|
For the general class of regular expressions the resulting expression can be of
|
|||
|
|
exponential size, as we explain in more detail in Section 3. In Section 4, we
|
|||
|
|
improve REWRITE to deal with real-world, and therefore incomplete, samples. In
|
|||
|
|
contrast to REWRITE, which fails when its input automaton is not equivalent to
|
|||
|
|
a SORE, the resulting improvement, called RWR, repairs the input automaton
|
|||
|
|
until it becomes equivalent to a SORE. We also develop an extension of RWR,
|
|||
|
|
called RWR2 , which improves the precision of RWR at the cost of increased running
|
|||
|
|
time.
|
|||
|
|
For the settings where extremely little XML data is available to infer a
|
|||
|
|
schema from (for instance, when the data is returned as answers to queries or
|
|||
|
|
Web service requests [Ngu et al. 2005; Oaks and ter Hofstede 2007]), we
|
|||
|
|
introduce in Section 6 the algorithm CRX. CRX successfully learns the class
|
|||
|
|
of CHAREs, a strict subclass of the SOREs that nevertheless holds great
|
|||
|
|
practical importance. Indeed, the same investigation as before reveals that
|
|||
|
|
more than 90% of the regular expressions occurring in practical schemas are
|
|||
|
|
CHAREs [Martens et al. 2006].
|
|||
|
|
We experimentally validate RWR, RWR2 , and CRX in Section 7 on both small and
|
|||
|
|
large samples drawn from real-world target DTDs whose regular expressions
|
|||
|
|
fall both within the class of SOREs/CHAREs and outside of those classes. In
|
|||
|
|
all settings, our algorithms outperform existing systems in accuracy, conciseness, and speed. Further, we assess the strong generalization ability of CRX by
|
|||
|
|
establishing on average the minimal number of sample words needed to derive
|
|||
|
|
optimal regular expressions. In Section 8 we discuss how to extend RWR and
|
|||
|
|
CRX to incrementally compute the inferred regular expressions when new data
|
|||
|
|
arrive, how to address noise, and how to deal with numerical predicates. We
|
|||
|
|
begin in the next section with a discussion of related work, and conclude in
|
|||
|
|
Section 9.
|
|||
|
|
It is important to note that this article differs from its conference version [Bex
|
|||
|
|
et al. 2006] in the following way.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:8
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
—First and foremost, it corrects the results of Bex et al. [2006] by providing
|
|||
|
|
a completely new algorithm for converting automata into equivalent SOREs
|
|||
|
|
(provided such a SORE exists), and gives a full correctness proof (Section 3).
|
|||
|
|
In contrast to what is claimed in Bex et al. [2006], the conversion algorithm
|
|||
|
|
of Bex et al. [2006] does not always yield an equivalent SORE, as discussed
|
|||
|
|
in Section 5.
|
|||
|
|
—It introduces new heuristics (based on a language size criterion) for dealing
|
|||
|
|
with real-world, and therefore incomplete datasets (Section 4).
|
|||
|
|
—It adds new experiments that measure: (1) the impact of noise and (2) the
|
|||
|
|
accuracy of our algorithms under various levels of missing data.
|
|||
|
|
2. RELATED WORK
|
|||
|
|
Schema inference. Schemas for semistructured data have been defined in
|
|||
|
|
Buneman et al. [1997], Fernandez and Suciu [1998], and McHugh et al.
|
|||
|
|
[1997] and their inference has been addressed in Goldman and Widom [1997],
|
|||
|
|
and Nestorov et al. [1997, 1998]. The methods in Nestorov et al. [1997] and
|
|||
|
|
Goldman and Widom [1997] focus on the derivation of a graph summary
|
|||
|
|
structure (called full representative object or dataguide) for a semistructured
|
|||
|
|
database. This data structure contains all paths in the database. Approximations of this structure are considered by restricting to paths of a certain length.
|
|||
|
|
The latter then basically reduces to the derivation of an automaton from a set
|
|||
|
|
of bounded length strings. Naively restricting the algorithms to trees rather
|
|||
|
|
than graphs is inappropriate since no order is considered between the children
|
|||
|
|
of a node so that DTD-like schemas cannot be derived. However, even the use
|
|||
|
|
of more sophisticated encodings of the XML documents using edges between
|
|||
|
|
siblings would be to no avail since no algorithms are given to translate the
|
|||
|
|
obtained automata to regular expressions. In Nestorov et al. [1998], a schema
|
|||
|
|
is a typing by means of a datalog program. Again, no algorithms are given
|
|||
|
|
to transform datalog types into regular expressions. So, these approaches
|
|||
|
|
can therefore not be used to derive DTDs, not even when the semistructured
|
|||
|
|
database is tree-shaped.
|
|||
|
|
DTD inference. In the context of DTD inference, Sankey and Wong [2001]
|
|||
|
|
propose several approaches to generate probabilistic string automata to represent regular expressions. To transform these into actual regular expressions,
|
|||
|
|
and hence to obtain DTDs, the authors refer to the methods of Ahonen [1996].
|
|||
|
|
The latter provides a method to translate one-unambiguous nonprobabilistic
|
|||
|
|
string automata to regular expressions, as given by Brüggemann-Klein and
|
|||
|
|
Wood [1998], followed by a post-processing simplification step. Apart from several case analyses based on a dictionary example, no systematic study of the
|
|||
|
|
effectiveness of the approach is provided. In particular, in contrast to our results, no target class is given for which the set of transformations is complete.
|
|||
|
|
There are only a few papers describing systems for direct DTD inference
|
|||
|
|
[Garofalakis et al. 2003; Min et al. 2003; Chidlovskii 2001]. Only one of them is
|
|||
|
|
available for testing: XTRACT [Garofalakis et al. 2003]. In Section 7, we make a
|
|||
|
|
detailed comparison with our proposal. In contrast to our approach, the XTRACT
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:9
|
|||
|
|
|
|||
|
|
system generates for every separate string a regular expression while representing repeated subparts by introducing Kleene-*. In a second step, the system
|
|||
|
|
factorizes common subexpressions of these candidate regular expressions using algorithms from the logic optimization literature. Finally, in the third step,
|
|||
|
|
XTRACT applies the Minimum Description Length (MDL) principle to find the
|
|||
|
|
best RE among the candidates. Although the approach has been shown to work
|
|||
|
|
on real-world DTDs in Garofalakis et al. [2003] the XML data complying to
|
|||
|
|
these DTDs was generated. We report in Section 7 that XTRACT has two kinds of
|
|||
|
|
shortcomings on real-world XML data: (1) it generates large, long-winded, and
|
|||
|
|
difficult to interpret regular expressions; and (2) it cannot handle large datasets (over 1000 strings). The latter is due to the NP-hard submodule in the
|
|||
|
|
third step of the XTRACT algorithm [Fernau 2004]. The former problem seems
|
|||
|
|
to be more fundamental. The final step results in expressions consisting of
|
|||
|
|
disjunctions of regular expressions while in practice the large majority of regular expressions are concatenations of disjunctions [Martens et al. 2006]. As a
|
|||
|
|
result, larger datasets result in larger regular expressions.
|
|||
|
|
In Min et al. [2003] an adaptation of the XTRACT approach to a restricted
|
|||
|
|
class of regular expressions which form a subclass of SOREs is described. Although the system, according to the experiments conducted in Min et al. [2003],
|
|||
|
|
outperforms XTRACT in accuracy and efficiency, it seems that the two fundamental shortcomings described earlier remain. It would thus be surprising if the
|
|||
|
|
system performed much better than XTRACT on real-world data. Similarly to
|
|||
|
|
Ahonen [1996], the approach of Chidlovskii [2001] relies on the translation of
|
|||
|
|
Glushkov automata to regular expressions which, in general, can lead to an
|
|||
|
|
exponential size increase.
|
|||
|
|
Trang [Clark ] is state-of-the-art software written by James Clark intended
|
|||
|
|
as a schema translator for the schema languages DTDs, Relax NG, and XML
|
|||
|
|
Schema. In addition, Trang allows to infer a schema for a given set of XML
|
|||
|
|
documents. We discuss Trang further in Section 7.1.
|
|||
|
|
Language inference. Learning of regular languages from positive examples in
|
|||
|
|
the computational learning community is mostly directed towards inference of
|
|||
|
|
automata as opposed to inference of regular expressions [Angluin and Smith
|
|||
|
|
1983; Pitt 1989; Sakakibara 1997]. As noted by Fernau [2004] and argued
|
|||
|
|
in the previous section, first using learning algorithms for deterministic automata and then transforming these into regular expressions in general leads
|
|||
|
|
to unmanageable and long-winded regular expressions. Some approaches to
|
|||
|
|
inference of regular expressions for restricted cases have been considered. For
|
|||
|
|
instance, Brāzma [1993] showed that regular expressions without union can
|
|||
|
|
be approximately learned in polynomial time from a set of examples satisfying
|
|||
|
|
some criteria. Fernau [2009] provided a learning algorithm for finite unions
|
|||
|
|
of pairwise left-aligned union-free regular expressions. These expressions are
|
|||
|
|
different from the expressions we consider here: they are not included in the
|
|||
|
|
class of SOREs and do not contain all CHAREs. The development is purely
|
|||
|
|
theoretical, no experimental validation has been performed.
|
|||
|
|
Automata to RE translation. Although heuristics for automata to RE translations [Delgado and Morais 2004; Han and Wood 2007] have been proposed,
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:10
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 2. (a) The SOA accepting the same language as the SORE a . b .(c+d+ ). (b) The SOA generated
|
|||
|
|
by 2T-INF for the sample S = {bacacdacde, cbacdbacde, abccaadcde}.
|
|||
|
|
|
|||
|
|
all of them are optimizations of the classical state elimination algorithm. In
|
|||
|
|
particular, they investigate the best order to eliminate states when going from
|
|||
|
|
automata to regular expressions. So, they focus on the class of all automata
|
|||
|
|
for which, as explained in Section 3, an exponential increase in size cannot be
|
|||
|
|
avoided in general. Further, the methods remain theoretical as no experimental
|
|||
|
|
analysis has been performed. Caron and Ziadi [2000] devise an algorithm deciding whether an automaton is Glushkov. If so, the automaton can be rewritten
|
|||
|
|
into a short equivalent regular expression. Their method works in a top-down
|
|||
|
|
fashion, that is, it derives the top nodes of the parse tree corresponding to
|
|||
|
|
the regular expression first, and subsequently proceeds downward in the tree.
|
|||
|
|
Consequently, the method first derives the largest subexpressions of the expression, making it harder to devise heuristics in the presence of missing data.
|
|||
|
|
In contrast, our approach is bottom-up, that is, starting from the leaf nodes of
|
|||
|
|
the parse tree, composing them into the smallest subexpressions.
|
|||
|
|
3. A COMPLETE ALGORITHM FOR INFERRING SORES
|
|||
|
|
Our goal in this section is to infer a SORE s equivalent to a target SORE r
|
|||
|
|
given only a finite sample S ⊆ L(r). To this end, we first learn from S a Single
|
|||
|
|
Occurrence Automaton (SOA for short). A SOA is a specific kind of deterministic
|
|||
|
|
finite state automaton in which all states, except for the initial and final state,
|
|||
|
|
are element names. Figure 2(a) gives an example. Note that in contrast to the
|
|||
|
|
classical definition of automata, no edges are labeled: all incoming edges in a
|
|||
|
|
state a are assumed to be labeled by a. As such, a word a1 , . . . , an is accepted if
|
|||
|
|
there is an edge from the initial state to a1 , an edge from a1 to a2 ,. . . , and an
|
|||
|
|
edge from an to the final state. Thus, the SOA in Figure 2(a) accepts the same
|
|||
|
|
language as a . b .(c + d+ ).
|
|||
|
|
Definition 6 (SOA). Let src and sink be two special symbols, distinct from
|
|||
|
|
the element names, that will serve as the initial and final state, respectively. A
|
|||
|
|
single occurrence automaton is a finite directed graph G = (V, E) such that:
|
|||
|
|
(1) {src, sink} ⊆ V and all nodes in V − {src, sink} are element names; and
|
|||
|
|
(2) src has only outgoing edges; sink has only incoming edges; and every v ∈
|
|||
|
|
V − {src, sink} is visited during a walk from src to sink.
|
|||
|
|
Note that V − {src, sink} can be empty. We write L(G) for the set of all words
|
|||
|
|
accepted by G; V(G) for the set of G’s vertices, and E(G) for G’s edge relation.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:11
|
|||
|
|
|
|||
|
|
Algorithm 1. 2T-INF
|
|||
|
|
Input: a finite set of sample strings S
|
|||
|
|
Output: a SOA G such that S ⊆ L(G)
|
|||
|
|
1: Let V be the set of states consisting of all element names occurring in S plus the
|
|||
|
|
initial state src and final state sink
|
|||
|
|
2: Initialize E := ∅
|
|||
|
|
3: for each string a1 . . . an in S do
|
|||
|
|
4:
|
|||
|
|
add the edges (src, a1 ), (a1 , a2 ), . . . , (an, sink) to E
|
|||
|
|
5: end for
|
|||
|
|
6: return G = (V, E)
|
|||
|
|
|
|||
|
|
3.1 Learning an Automaton
|
|||
|
|
Given a sample S, we can learn an automaton G that accepts all words in S by
|
|||
|
|
means of the algorithm 2T-INF shown in Algorithm 1. Its behavior is illustrated
|
|||
|
|
in Figure 2(a) on the sample S = {abc, abdd} and in Figure 2(b) on the sample
|
|||
|
|
S = {bacacdacde, cbacdbacde, abccaadcde}. 2T-INF was introduced by Garcı́a and
|
|||
|
|
Vidal [1990], who also proved the following proposition.
|
|||
|
|
PROPOSITION 7 ([GARCÍA AND VIDAL 1990]). 2T-INF is sound, that is, S ⊆
|
|||
|
|
L(2T-INF(S)) for each sample S. Moreover, 2T-INF is minimal, that is, for each SOA
|
|||
|
|
G with S ⊆ L(G), 2T-INF(S) is a subgraph of G and hence L(2T-INF(S)) ⊆ L(G).
|
|||
|
|
It turns out that 2T-INF is also complete for building a SOA representation of
|
|||
|
|
a target SORE r, provided that its input sample is representative with regard
|
|||
|
|
to r.
|
|||
|
|
Definition 8 (Representative Sample). A word v of length 2 is said to be a
|
|||
|
|
2-gram of a set of words W if it occurs as a subword in some w ∈ W. A sample
|
|||
|
|
S is representative of a SORE r if S ⊆ L(r) and the following statements hold:
|
|||
|
|
(1) for every a ∈ starting a word in L(r) there is a word in S that starts with
|
|||
|
|
a;
|
|||
|
|
(2) for every a ∈ ending a word in L(r) there is a word in S that ends with a;
|
|||
|
|
(3) every 2-gram of L(r) is a 2-gram of S.
|
|||
|
|
If S is not representative of r, then we say that S does not cover r.
|
|||
|
|
For instance, the sample {a, b, c} is representative for a + b + c but {a, c}
|
|||
|
|
is not since it lacks a word starting with b. Furthermore, the sample
|
|||
|
|
{bacacdacde, cbacdbacde, abccaadcde} is representative for ((b?(a + c)+ )d)+ e but
|
|||
|
|
{bacacdacde, cbacdbacde} is not since it does not contain the 2-gram ab.
|
|||
|
|
PROPOSITION 9.
|
|||
|
|
L(r).
|
|||
|
|
|
|||
|
|
If S is a representative sample of SORE r then L(2T-INF(S)) =
|
|||
|
|
|
|||
|
|
PROOF. It is not hard to see that every SORE r can be transformed into an
|
|||
|
|
equivalent SOA Gr : we take as nodes of Gr all element names occurring in r
|
|||
|
|
plus the initial state src and the final state sink; for each alphabet symbol that
|
|||
|
|
starts a word in L(r) we add the edge (src, a) to Gr ; for each alphabet symbol
|
|||
|
|
that ends a word in L(r) we add an edge (a, sink) to Gr , and for each alphabet
|
|||
|
|
symbol b that follows an alphabet symbol a in a word in L(r) we add the edge
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:12
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 3. A SOA not equivalent to any SORE. It accepts the same language as a(ba)+ .
|
|||
|
|
|
|||
|
|
(a, b) to Gr . Now reason as follows. Clearly, S ⊆ L(r) = L(Gr ). Hence, 2T-INF(S)
|
|||
|
|
is a subgraph of Gr by Proposition 7. Since S is a representative sample of r,
|
|||
|
|
however, every edge of Gr must also be in 2T-INF(S). As such, 2T-INF(S) = Gr and
|
|||
|
|
hence L(2T-INF(S)) = L(Gr ).
|
|||
|
|
3.2 From SOA to SORE
|
|||
|
|
Proposition 9 shows that it is possible to learn a SOA representation of a target
|
|||
|
|
SORE r, provided that we are given enough data. To transform this SOA into
|
|||
|
|
a regular expression, an obvious approach would be to use known techniques
|
|||
|
|
such as the classical state elimination algorithm (refer to, e.g., Hopcroft and
|
|||
|
|
Ullman [1979]). Unfortunately, as already hinted upon by Fernau [2004, 2009]
|
|||
|
|
and as we illustrate shortly, it is very difficult to get concise regular expressions
|
|||
|
|
from an automaton representation. For instance, the classical state elimination
|
|||
|
|
algorithm applied to the SOA generated by 2T-INF in Figure 2(b) yields the
|
|||
|
|
expression:1
|
|||
|
|
(aa∗ d + (c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d) + (b + aa∗ b + (c +
|
|||
|
|
aa∗ c)(c + aa∗ c)∗ (b + aa∗ b))(aa∗ b + (c + aa∗ c)(c + aa∗ c)∗
|
|||
|
|
(b + aa∗ b))∗ (aa∗ d + (c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d)))(aa∗ d +
|
|||
|
|
(c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d) + (b + aa∗ b + (c + aa∗ c)(c +
|
|||
|
|
aa∗ c)∗ (b + aa∗ b))(aa∗ b + (c + aa∗ c)(c + aa∗ c)∗ (b + aa∗ b))∗
|
|||
|
|
|
|||
|
|
which differs quite a bit from the equivalent SORE
|
|||
|
|
((b?(a + c))+ d)+ e
|
|||
|
|
|
|||
|
|
(‡).
|
|||
|
|
|
|||
|
|
Actually, results by Ehrenfeucht and Zeiger [1976], Gelade and Neven [2008],
|
|||
|
|
and Gruber and Holzer [2008] show that it is impossible in general to generate
|
|||
|
|
concise regular expressions from automata: there are automata, even SOAs as
|
|||
|
|
generated by 2T-INF, for which the number of occurrences of alphabet symbols in
|
|||
|
|
the smallest equivalent expression is exponential in the size of the automaton.
|
|||
|
|
For such automata, a concise regular expression representation hence does not
|
|||
|
|
exist.
|
|||
|
|
These results imply that there are SOAs G for which an equivalent SORE
|
|||
|
|
does not exist (Figure 3 gives a simple example). Note, however, that when
|
|||
|
|
such a SORE r does exist, its size is always linearly bounded by the number of
|
|||
|
|
states of G. Indeed, since every alphabet symbol can occur at most once in r, the
|
|||
|
|
size of r is linearly bounded by the alphabet symbols that it mentions. Since G
|
|||
|
|
and r are equivalent, these symbols are exactly the states of G (minus src and
|
|||
|
|
sink). Hence, the SOREs constitute a well-behaved and concisely representable
|
|||
|
|
subset of the regular languages. It is therefore natural to investigate how to
|
|||
|
|
1 Transformation computed by JFLAP: www.jflap.org.
|
|||
|
|
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:13
|
|||
|
|
|
|||
|
|
transform a given SOA into an equivalent SORE when such a SORE exists.
|
|||
|
|
Clearly, the previous example illustrates that the classical state elimination
|
|||
|
|
algorithm does not suffice for this purpose.
|
|||
|
|
For that reason, we introduce in this section a novel graph-rewriting approach for transforming SOAs into SOREs. While our approach is related to the
|
|||
|
|
classical state-elimination algorithm for transforming an arbitrary automaton
|
|||
|
|
into a regular expression, we do not eliminate states by introducing additional
|
|||
|
|
edges (thereby duplicating subexpressions) but instead replace sets of states
|
|||
|
|
by single states (taking care to avoid duplication). In addition, there are two
|
|||
|
|
rewriting steps that only remove edges.
|
|||
|
|
Just as the classical algorithm, it is necessary for the definition of the graph
|
|||
|
|
rewrite rules to define a generalization of SOAs in which internal states are
|
|||
|
|
allowed to be labeled by SOREs (as opposed to element names from ). This generalization is defined as follows. Call two regular expressions r and s alphabetdisjoint if r and s have no alphabet symbol in common. For example, (a+b)? and
|
|||
|
|
c+ are alphabet-disjoint, whereas (a + b) and b?c+ are not. Call an expression
|
|||
|
|
r proper if it accepts at least one nonempty word (i.e., it is not equivalent to ∅,
|
|||
|
|
nor to ε).
|
|||
|
|
Definition 10. A generalized Single Occurrence Automaton (generalized
|
|||
|
|
SOA for short) is a finite graph G = (V, E) such that:
|
|||
|
|
(1) {src, sink} ⊆ V and all vertices in V − {src, sink} are pairwise alphabetdisjoint proper SOREs; and
|
|||
|
|
(2) the edge relation E is such that src has only outgoing edges; sink has only
|
|||
|
|
incoming edges; and every v ∈ V is visited by a walk from src to sink.
|
|||
|
|
A word w ∈ ∗ is accepted by G if there is a walk src r1 . . . rm sink in G and a
|
|||
|
|
division of w into subwords w = w1 . . . wm such that wi ∈ L(ri ), for 1 ≤ i ≤ m.
|
|||
|
|
Again, we write L(G) for the set of all words accepted by G.
|
|||
|
|
Figure 7 shows some examples. Clearly, every SOA is also a generalized
|
|||
|
|
SOA. In what follows, we write PredG (s) for the set of all direct predecessors of
|
|||
|
|
a SORE s in G, and SuccG (s) for the set of all direct successors of s in G.
|
|||
|
|
PredG (s) := {r | (r, s) ∈ E(G)},
|
|||
|
|
SuccG (s) := {t | (s, t) ∈ E(G)}.
|
|||
|
|
−
|
|||
|
|
Furthermore, we write Pred−
|
|||
|
|
G (s) for PredG (s) − {s} and similarly SuccG (s) for
|
|||
|
|
SuccG (s) − {s}. Finally, we write
|
|||
|
|
|
|||
|
|
PredG (s) ∪ {s} if s = s + for some s
|
|||
|
|
+
|
|||
|
|
PredG (s) :=
|
|||
|
|
PredG (s)
|
|||
|
|
otherwise
|
|||
|
|
|
|||
|
|
SuccG (s) ∪ {s} if s = s + for some s
|
|||
|
|
(s)
|
|||
|
|
:=
|
|||
|
|
Succ+
|
|||
|
|
G
|
|||
|
|
SuccG (s)
|
|||
|
|
otherwise.
|
|||
|
|
|
|||
|
|
Rewrite rules. Our system of rewrite rules consists of the seven rules shown
|
|||
|
|
in Figures 4–6: one rule to introduce disjunction (r + s), four rules to introduce
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:14
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 4. Rewrite rules part 1. In the illustrations, P is the set PredG (r)−{r, s}. Sis the set SuccG (s)−
|
|||
|
|
+
|
|||
|
|
{r, s}. The gray loops on r and s indicate that r ∈ Succ+
|
|||
|
|
G (r) and s ∈ SuccG (s), respectively.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:15
|
|||
|
|
|
|||
|
|
Fig. 5. Rewrite rules part 2. In the illustrations, P is the set PredG (r)−{r, s}. Sis the set SuccG (s)−
|
|||
|
|
+
|
|||
|
|
{r, s}. The gray loops on r and s indicate that r ∈ Succ+
|
|||
|
|
G (r) and s ∈ SuccG (s), respectively.
|
|||
|
|
|
|||
|
|
concatenation (r . s, r? . s, r . s?, and r? . s?), one rule to introduce iteration (r + ),
|
|||
|
|
and one rule to introduce optionals (r?). At the basis of the first five rules lies
|
|||
|
|
the contraction of two states r and s into a single new state t, which is defined
|
|||
|
|
as follows.
|
|||
|
|
Definition 11 (State Contraction). Let G be a generalized SOA; let r and s
|
|||
|
|
be states in G; and let t be a state not in G. The contraction of r and s into t is
|
|||
|
|
the generalized SOA G[r, s ⇒ t] obtained from G as follows:
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:16
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 6. Rewrite rules part 3. In the illustrations, P is the set PredG (r)−{r, s}. Sis the set SuccG (s)−
|
|||
|
|
{r, s}. Note in particular that the rule OPTIONAL r? can only be applied when G contains only one
|
|||
|
|
node besides src and sink.
|
|||
|
|
|
|||
|
|
(1) Add t as a new state to G;
|
|||
|
|
(2) make every v ∈ PredG (r) − {r, s} a predecessor of t;
|
|||
|
|
(3) make every w ∈ SuccG (r) − {r, s} a successor of t;
|
|||
|
|
(4) add a loop t → t if r ∈ SuccG (s); and
|
|||
|
|
(5) remove r, s and all of their incoming and outgoing edges.
|
|||
|
|
Note that state contraction is not symmetric.
|
|||
|
|
To illustrate, the contraction G[a, c ⇒ a + c] of the generalized SOA G in
|
|||
|
|
Figure 7(a) is shown in Figure 7(b). Similarly, the contraction G[b, a + c ⇒
|
|||
|
|
b? .(a + c)] of the generalized SOA G in Figure 7(b) is shown in Figure 7(c). Note
|
|||
|
|
that if r = s, then G[r, s ⇒ t] is simply a substitution of r by the new state t.
|
|||
|
|
To simplify notation, we simply write G[r ⇒ t] for such contractions in what
|
|||
|
|
follows.
|
|||
|
|
In addition to contraction, the rewrite rules also use the following
|
|||
|
|
operation.
|
|||
|
|
Definition 12. If G is a generalized SOA and r, s are states in G, then we
|
|||
|
|
write G (r, s) to denote the generalized SOA obtained from G by removing the
|
|||
|
|
edge from r to s, if present.
|
|||
|
|
In what follows, we write G H to indicate that G rewrites to H in a single
|
|||
|
|
step according to the rewrite rules in Figures 4–6, and G ∗ H to indicate that
|
|||
|
|
G rewrites to H in zero or more steps.
|
|||
|
|
The following proposition shows that the rewrite rules are sound.
|
|||
|
|
PROPOSITION 13. If G is a generalized SOA and G H then H is also a
|
|||
|
|
generalized SOA and L(G) = L(H).
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:17
|
|||
|
|
|
|||
|
|
PROOF. First observe that, since all states in a generalized SOA are pairwise
|
|||
|
|
alphabet-disjoint proper SOREs, the new states r + s; r . s; r? . s; r . s?; r? . s?; r + ;
|
|||
|
|
and r? introduced by the rewrite rules in Figures 4–6 must themselves be proper
|
|||
|
|
SOREs alphabet-disjoint with the remaining states. As such, all states in H
|
|||
|
|
are pairwise alphabet-disjoint proper SOREs. To show that H is a generalized
|
|||
|
|
SOA, it hence remains to show that every state in H participates in a walk
|
|||
|
|
from src to sink. Hereto, we distinguish the following three cases.
|
|||
|
|
—H = G[r, s ⇒ t] for some t. Then, since G is a generalized SOA, and r and s
|
|||
|
|
particpate in a walk from src to sink. In particular, there is a walk from src
|
|||
|
|
to r in G, and a walk from s to sink. Then, by definition of state contraction,
|
|||
|
|
there is a walk from src to t and from t to sink in H, that is, t participates in
|
|||
|
|
a walk from src to sink in H.
|
|||
|
|
—H = G[r ⇒ r + ] (r + , r + ). Then, by definition of state contraction and since
|
|||
|
|
r participates in a walk from src to sink in G, r + must participate in a
|
|||
|
|
walk from src to sink in G[r ⇒ r + ]. This walk can always be transformed
|
|||
|
|
into a walk from src to sink in H by removing the edge (r + , r + ) should it
|
|||
|
|
occur.
|
|||
|
|
—H = G[r ⇒ r?] (src, sink). Then, by definition of state contraction and since
|
|||
|
|
r participates in a walk from src to sink in G, r? must participate in a walk
|
|||
|
|
from src to sink in G[r ⇒ r?]. Since the edge (src, sink) cannot occur in this
|
|||
|
|
walk (recall that src has no incoming edges and sink has no outgoing edges),
|
|||
|
|
r? also participates in a walk from src to sink in H.
|
|||
|
|
To see that L(G) = L(H) we reason by a case analysis on the rewrite rule used
|
|||
|
|
to transform G into H. For economy of space, we only illustrate this reasoning
|
|||
|
|
for DISJUNCTION r + s; the other cases are similar.
|
|||
|
|
So, suppose that G was rewritten into H by DISJUNCTION r + s, that is, H =
|
|||
|
|
G[r, s ⇒ r+s]. Then r and s have the same (extended) predecessor and successor
|
|||
|
|
set. From this, it follows that the following statements are equivalent.
|
|||
|
|
(1) s ∈ SuccG (r);
|
|||
|
|
(2) r ∈ SuccG (s);
|
|||
|
|
(3) s ∈ Succ+
|
|||
|
|
G (s);
|
|||
|
|
(4) r ∈ Succ+
|
|||
|
|
G (r).
|
|||
|
|
For instance, s ∈ SuccG (r) ⇔ r ∈ SuccG (s) since:
|
|||
|
|
s ∈ SuccG (r) ⇔ s ∈ SuccG (r) ∪ {r}
|
|||
|
|
⇔ s ∈ Succ+
|
|||
|
|
G (r)
|
|||
|
|
+
|
|||
|
|
⇔ s ∈ SuccG (s)
|
|||
|
|
⇔ s ∈ Pred+
|
|||
|
|
G (s)
|
|||
|
|
+
|
|||
|
|
⇔ s ∈ PredG (r)
|
|||
|
|
|
|||
|
|
since r = s
|
|||
|
|
by definition of Succ+
|
|||
|
|
G (r)
|
|||
|
|
+
|
|||
|
|
since Succ+
|
|||
|
|
G (r) = SuccG (s)
|
|||
|
|
+
|
|||
|
|
by definition of Succ+
|
|||
|
|
G (s) and PredG (s)
|
|||
|
|
+
|
|||
|
|
since Pred+
|
|||
|
|
G (r) = PredG (s)
|
|||
|
|
|
|||
|
|
⇔ s ∈ PredG (r) ∪ {r}
|
|||
|
|
⇔ s ∈ PredG (r)
|
|||
|
|
|
|||
|
|
by definition of Pred+
|
|||
|
|
G (r)
|
|||
|
|
since r = s
|
|||
|
|
|
|||
|
|
⇔ r ∈ SuccG (s)
|
|||
|
|
|
|||
|
|
by definition of PredG (r) and SuccG (s)
|
|||
|
|
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:18
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
The other equivalences can be similarly obtained. From these equivalences,
|
|||
|
|
it follows that G must take one the two forms illustrated for rewrite rule
|
|||
|
|
DISJUNCTION r + s in Figure 4. In both cases, the corresponding H is also shown.
|
|||
|
|
Now suppose that w = w1 . . . wm ∈ ∗ is recognized by the walk src, t1 , . . . ,
|
|||
|
|
tm, sink in G with wi ∈ L(ti ) for 1 ≤ i ≤ m. Let the sequence src, t1 , . . . , tm, sink
|
|||
|
|
be obtained from src, t1 , . . . , tm, sink by replacing every occurrence of r and s by
|
|||
|
|
r + s. By inspection of the illustrations for rule DISJUNCTION r + s in Figure 4 it
|
|||
|
|
is not difficult to see that src, t1 , . . . , tm, sink is a walk in H. Moreover, wi ∈ L(ti )
|
|||
|
|
by construction for 1 ≤ i ≤ m. Therefore, w ∈ L(H) and hence L(G) ⊆ L(H).
|
|||
|
|
Conversely, suppose that w = w1 . . . wm ∈ ∗ is recognized by src, t1 , . . . , tm, sink
|
|||
|
|
in H with wi ∈ L(ti ) for 1 ≤ i ≤ m. Determine vi as follows:
|
|||
|
|
⎧
|
|||
|
|
⎪
|
|||
|
|
⎨ti if ti = r + s
|
|||
|
|
ti = r if ti = r + s and wi ∈ L(r)
|
|||
|
|
⎪
|
|||
|
|
⎩
|
|||
|
|
s if ti = r + s and wi ∈ L(s)
|
|||
|
|
By inspection of the illustrations for rule DISJUNCTION r + s in Figure 4 it is
|
|||
|
|
not difficult to see that src, t1 , . . . , tm, sink is a walk in G. Moreover, wi ∈ L(ti )
|
|||
|
|
for 1 ≤ i ≤ m. Therefore w ∈ L(G) and hence L(H) ⊆ L(G). As such, L(G) =
|
|||
|
|
L(H).
|
|||
|
|
Since each rewrite rule either contracts two states into a single state or
|
|||
|
|
removes an edge from G, the size of H is always smaller than G. Therefore, we
|
|||
|
|
have the next proposition.
|
|||
|
|
PROPOSITION 14. The system of rewrite rules in Figures 4–6 is terminating:
|
|||
|
|
there is no infinite sequence of rewrite steps G H I . . .
|
|||
|
|
Our algorithm REWRITE, shown in Algorithm 2, then operates as follows. First,
|
|||
|
|
it checks whether the input SOA G corresponds to the empty language (∅) or
|
|||
|
|
the empty word (ε) in lines 1–5. If so, it returns the corresponding regular
|
|||
|
|
expression. Otherwise, it rewrites G until no further rules apply. It then checks
|
|||
|
|
whether the resulting generalized SOA is final.
|
|||
|
|
Definition 15. As generalized SOA G is final if E(G) = {(src, r), (r, sink)}
|
|||
|
|
with r distinct from src and sink. In other words, G is final if it is a chain
|
|||
|
|
consisting of the source, an arbitrary regular expression, and the sink.
|
|||
|
|
If the resulting generalized SOA is indeed final, then clearly L(G) = L(r),
|
|||
|
|
and r is returned as result. If the resulting generalized SOA is not final, then
|
|||
|
|
G is not equivalent to a SORE (as we formally show further on), and REWRITE
|
|||
|
|
fails. To illustrate, Figure 7 shows an example run of REWRITE on the example
|
|||
|
|
SOA from Figure 2(b).
|
|||
|
|
THEOREM 16. On input SOA G, REWRITE fails if and only if G is not equivalent
|
|||
|
|
to a SORE. Otherwise, REWRITE returns a SORE equivalent to G. Moreover,
|
|||
|
|
5
|
|||
|
|
REWRITE operates in time O(n ) where n is the number of states in G.
|
|||
|
|
Note that the complexity O(n5 ) is reasonable since when we apply REWRITE to
|
|||
|
|
the result of 2T-INF on a sample S, n corresponds to the (typically small) number
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:19
|
|||
|
|
|
|||
|
|
Algorithm 2. REWRITE
|
|||
|
|
Input: a SOA G
|
|||
|
|
Output: a SORE r such that L(r) = L(G)
|
|||
|
|
1: if sink is not reachable from src in G then
|
|||
|
|
2:
|
|||
|
|
return ∅
|
|||
|
|
3: else if E(G) = {(src, sink)} then
|
|||
|
|
4:
|
|||
|
|
return ε
|
|||
|
|
5: else
|
|||
|
|
6:
|
|||
|
|
while a rewrite rule from Figures 4–6 can be applied do
|
|||
|
|
7:
|
|||
|
|
perform the rewrite rule on G
|
|||
|
|
8:
|
|||
|
|
end while
|
|||
|
|
9:
|
|||
|
|
if G is final then
|
|||
|
|
10:
|
|||
|
|
return the corresponding regular expression
|
|||
|
|
11:
|
|||
|
|
else
|
|||
|
|
12:
|
|||
|
|
fail
|
|||
|
|
13:
|
|||
|
|
end if
|
|||
|
|
14: end if
|
|||
|
|
|
|||
|
|
of distinct element names occurring in S, not the total number or total length
|
|||
|
|
of words in S.
|
|||
|
|
The remainder of this section is devoted to the proof of Theorem 16, which
|
|||
|
|
is divided into three steps. First, we show that REWRITE is sound.
|
|||
|
|
PROPOSITION 17. If REWRITE(G) does not fail then it returns a SORE equivalent to G, for any SOA G.
|
|||
|
|
PROOF.
|
|||
|
|
|
|||
|
|
We distinguish three cases.
|
|||
|
|
|
|||
|
|
(1) If sink is not reachable from src then REWRITE(G) = ∅ (clearly a SORE) and
|
|||
|
|
L(G) = ∅ = L(∅), as desired.
|
|||
|
|
(2) If E(G) = {(src, sink)} then REWRITE(G) = ε (again clearly a SORE), and
|
|||
|
|
L(G) = {ε} = L(ε), as desired.
|
|||
|
|
(3) Otherwise, G is rewritten into a final generalized SOA H with E(H) =
|
|||
|
|
{(src, t), (t, sink)} (t distinct from src and sink) and REWRITE(G) = t. In
|
|||
|
|
particular, t is a SORE. By Proposition 13, L(G) = L(H) and thus, since
|
|||
|
|
E(H) = {(src, t), (t, sink)}, L(G) = L(H) = L(t) = L(REWRITE(G)), as desired.
|
|||
|
|
Next, we show that REWRITE has the claimed complexity.
|
|||
|
|
PROPOSITION 18. REWRITE operates in time O(n5 ), where n is the number of
|
|||
|
|
states of its input G.
|
|||
|
|
PROOF. We assume that checking whether there is an edge from state r
|
|||
|
|
to state s can be done in constant time (for instance, using an adjacency matrix representation). To see that REWRITE runs in time O(n5 ) under this assumption, let us check that lines 1–4, lines 6–7, and lines 8–10 all run in
|
|||
|
|
O(n5 ).
|
|||
|
|
(Lines 1–4). Since G has at most n2 edges, checking whether sink is reachable
|
|||
|
|
from src can be done in time O(n2 ) using depth-first search. Moreover, checking
|
|||
|
|
whether E(G) = {(src, sink)} can also be done in time O(n2 ).
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:20
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 7. An execution of REWRITE on the example automaton in Figure 2(b). Step (1) applies DISJUNCTION r + s with r = a and s = b. Step (2) applies CONCATENATION r? . s with r = b and s = a + c. Step
|
|||
|
|
(3) applies ITERATION r + with r = b? .(a+ c). Step (4) applies CONCATENATION r . s with r = (b? .(a+ c))+
|
|||
|
|
and s = d. Step (5) applies ITERATION r + with r = (b? .(a + c))+ . d. One more application of CON+
|
|||
|
|
+
|
|||
|
|
CATENATION r . s with r = ((b? .(a + c)) . d) and s = e (not shown) leads to the resulting expression
|
|||
|
|
((b? .(a + c))+ . d)+ . e.
|
|||
|
|
|
|||
|
|
= G1 , G2 , . . . , Gk is the sequence of generalized
|
|||
|
|
(Lines 6–7). Suppose that G
|
|||
|
|
SOAs produced by lines 6–7 when rewriting G = G1 until no further rewrite
|
|||
|
|
rule applies. Since rewrite rules never introduce new states without also removing a state, every Gi has at most n states. Now reason as follows.
|
|||
|
|
since the automaton
|
|||
|
|
—The rule for optionals can be applied at most once in G
|
|||
|
|
that it returns is always final, and since no rewrite rule applies to a final
|
|||
|
|
generalized SOA. Checking the preconditions of the rule for optionals can be
|
|||
|
|
done in time O(n2 ), and its action can be performed in time O(n). As such, the
|
|||
|
|
on applying the rewrite rule for optionals is bounded
|
|||
|
|
total time spent in G
|
|||
|
|
2
|
|||
|
|
by O(n ).
|
|||
|
|
—Since the rewrite rules for disjunction and concatenation contract two states
|
|||
|
|
into a single one, these rewrite rules can be applied at most n times in
|
|||
|
|
G.
|
|||
|
|
Since of all their preconditions can be checked in time O(n4 ) (by iterating
|
|||
|
|
over all pairs of states r and s in the current automaton Gi and comparing
|
|||
|
|
Pred(r), Pred(s), Succ(r), and Succ(s) as desired) and since state contraction
|
|||
|
|
on the rewrite rules for
|
|||
|
|
can be done in time O(n), the total time spent in G
|
|||
|
|
disjunction and concatenation is bounded by O(n × n4 ) = O(n5 ).
|
|||
|
|
—Since the rule for iteration removes the loop of the state to which it is applied,
|
|||
|
|
and since each generalized SOA contains at most n loops, there can be at most
|
|||
|
|
n consecutive applications of this rule before another rewrite rule is applied.
|
|||
|
|
By the preceding remarks, there are at most n applications of the other
|
|||
|
|
rewrite rules, so the rewrite rule for iteration can be applied at most n2 times
|
|||
|
|
Since its precondition can be checked in constant time, and since its
|
|||
|
|
in G.
|
|||
|
|
on the rewrite rule
|
|||
|
|
action can be done in time O(n), the total time spent in G
|
|||
|
|
for iteration is bounded by O(n2 × n) = O(n3 ).
|
|||
|
|
(Lines 8–11). Finally, checking whether a generalized SOA is final and extracting the corresponding regular expression can be done in time O(n2 ).
|
|||
|
|
In summary, lines 1–4 run in time O(n2 ), lines 6–7 run in time O(n5 ), and
|
|||
|
|
lines 8–11 run in time O(n2 ), yielding a total running time of O(n5 ).
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:21
|
|||
|
|
|
|||
|
|
Finally, we show that REWRITE(G) fails if and only if G is not equivalent
|
|||
|
|
to a SORE, or equivalently, that REWRITE(G) does not fail if, and only if, G is
|
|||
|
|
equivalent to a SORE. This is actually the most involved part of the proof of
|
|||
|
|
Theorem 16. Proposition 17 already shows that if REWRITE(G) does not fail, then
|
|||
|
|
G is equivalent to a SORE. Hence, we remain to show the next proposition.
|
|||
|
|
PROPOSITION 19.
|
|||
|
|
not fail.
|
|||
|
|
|
|||
|
|
If SOA G is equivalent to a SORE, then REWRITE(G) does
|
|||
|
|
|
|||
|
|
Essentially, we prove this proposition in two steps. Call a generalized SOA
|
|||
|
|
proper if L(G) = ∅ and L(G) = {ε}.
|
|||
|
|
(1) We first show that for any proper SOA G equivalent to a SORE there exists
|
|||
|
|
a sequence of rewrite steps that ends in a final automaton (Corollary 46).
|
|||
|
|
(2) In addition, we show that if proper G can be rewritten into a final automaton
|
|||
|
|
by a particular sequence of rewrite steps, then any sequence of rewrite steps
|
|||
|
|
on G ends in a final automaton (Corollary 54).
|
|||
|
|
As such, REWRITE(G) cannot fail when G is equivalent to a SORE: either G is
|
|||
|
|
not proper, in which case lines 1–4 of Algorithm 2 return a valid expression, or
|
|||
|
|
G is proper and will hence be rewritten into a final automaton, in which case
|
|||
|
|
line 9 returns a valid expression. The details may be found in Appendix A.
|
|||
|
|
3.3 Discussion
|
|||
|
|
It should be noted that while the result of REWRITE is always a SORE, this
|
|||
|
|
SORE need not be easy to read (depending on the order of rewriting). For
|
|||
|
|
instance, it is possible for REWRITE to generate an expression r .(s? . t?)?. Clearly,
|
|||
|
|
the optional around (s? . t?) is redundant. Removing it leads to the simpler
|
|||
|
|
r .(s? . t?). For presentation to the user, it is therefore advisable to postprocess
|
|||
|
|
the result of REWRITE (and its variations in Section 4) using a regular expression
|
|||
|
|
simplification algorithm.
|
|||
|
|
4. DEALING WITH MISSING DATA
|
|||
|
|
The results of Section 3 suggest the following method to infer a SORE from a
|
|||
|
|
given sample S.
|
|||
|
|
(1) First, use 2T-INF to learn from S an automaton representation G of the
|
|||
|
|
target SORE r.
|
|||
|
|
(2) Next, convert G into a SORE using REWRITE.
|
|||
|
|
If S is a representative sample of r then G is equivalent to r by Proposition 9.
|
|||
|
|
Therefore, REWRITE(G) does not fail by Theorem 16, and hence REWRITE(G) is
|
|||
|
|
equivalent to r.
|
|||
|
|
Unfortunately, real-world samples are rarely representative. For instance,
|
|||
|
|
for target r = (a1 +· · ·+an)+ and increasing values of n, it is increasingly unlikely
|
|||
|
|
that a sample bears witness to each of the n2 2-grams needed to represent r.
|
|||
|
|
On such nonrepresentative samples, 2T-INF will construct an automaton for
|
|||
|
|
which L(G) is a strict subset of L(r). In particular, this automaton need not be
|
|||
|
|
equivalent to a SORE, and REWRITE(G) can fail. Figure 8 shows an example.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:22
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 8. The SOA generated by 2T-INF for the nonrepresentative sample S = {bacacdacde,
|
|||
|
|
abccaadcde}. The only rewrite rules that can be applied are ITERATION a+ and ITERATION c+ , after which REWRITE gets stuck in a nonfinal automaton and fails.
|
|||
|
|
|
|||
|
|
Fig. 9. Repair rules.
|
|||
|
|
|
|||
|
|
For that reason, we present in this section two modifications of REWRITE
|
|||
|
|
that “repair” G when rewriting gets stuck in a nonfinal automaton. The first
|
|||
|
|
modification, RWR, picks a single repair when rewriting gets stuck, independent
|
|||
|
|
of how the repair affects G. The second modification, RWR2 , in contrast, considers
|
|||
|
|
multiple repair strategies and selects the one that extends G in a minimal way.
|
|||
|
|
The repair rules used by both algorithms are shown in Figure 9. After a repair
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:23
|
|||
|
|
|
|||
|
|
Algorithm 3. RWR
|
|||
|
|
Input: a SOA G
|
|||
|
|
Output: a SORE r such that L(G) ⊆ L(r) if G is not equivalent to a SORE, and L(G) =
|
|||
|
|
L(r) otherwise.
|
|||
|
|
1: if sink is not reachable from src in G then
|
|||
|
|
2:
|
|||
|
|
return ∅
|
|||
|
|
3: else if E(G) = {(src, sink)} then
|
|||
|
|
4:
|
|||
|
|
return ε
|
|||
|
|
5: else
|
|||
|
|
6:
|
|||
|
|
while G is not final do
|
|||
|
|
7:
|
|||
|
|
if a rewrite rule from Figures 4–6 can be applied then
|
|||
|
|
8:
|
|||
|
|
apply the rewrite rule on G
|
|||
|
|
9:
|
|||
|
|
else
|
|||
|
|
10:
|
|||
|
|
apply a repair rule from Figure 9
|
|||
|
|
11:
|
|||
|
|
end if
|
|||
|
|
12:
|
|||
|
|
end while
|
|||
|
|
13:
|
|||
|
|
return the corresponding regular expression r
|
|||
|
|
14: end if
|
|||
|
|
|
|||
|
|
rule is applied, the automaton necessarily satisfies the precondition of the
|
|||
|
|
corresponding rewrite rule. Now note the following.
|
|||
|
|
PROPOSITION 20. Let G be a proper generalized SOA. If G is not final and no
|
|||
|
|
rewrite rule applies to G, then at least one of the repair rules in Figure 9 applies
|
|||
|
|
to G.
|
|||
|
|
PROOF. Since G is proper, it recognizes at least one nonempty word. Clearly,
|
|||
|
|
this can only happen when src has a successor r distinct from sink. We distinguish two cases.
|
|||
|
|
—Either r has a successor s distinct from src, sink, and r. Clearly, REPAIR r? . s?
|
|||
|
|
is then applicable to G.
|
|||
|
|
—If r does not have such a successor s, then we claim that src has another
|
|||
|
|
successor t, distinct from src, sink, and r. Indeed, suppose for the purpose
|
|||
|
|
of contradiction that no such successor exists. Then, since every state in G
|
|||
|
|
participates in a walk from src to sink, either E(G) = {(src, r), (r, sink)}, or
|
|||
|
|
E(G) = {(src, r), (r, r), (r, sink)}. In the first case G is final, in the second we
|
|||
|
|
can rewrite G using ITERATION r + —a contradiction in both cases. As such,
|
|||
|
|
the claimed t exists. Then, since src ∈ PredG (r) ∩ PredG (t), REPAIR r + t is
|
|||
|
|
applicable to G.
|
|||
|
|
As such, we can always apply a repair rule if rewriting gets stuck in a
|
|||
|
|
nonfinal automaton, after which rewriting can continue.
|
|||
|
|
4.1 A Greedy Approach: RWR
|
|||
|
|
An outline of RWR (short for REWRITE with REPAIRS) is shown in Algorithm 3. Like
|
|||
|
|
REWRITE, it first checks whether its input G is equivalent to ∅ or ε. Otherwise,
|
|||
|
|
G is rewritten using the rewrite rules in Figures 4–6 until a final automaton is
|
|||
|
|
reached, arbitrarily selecting a repair rule when rewriting gets stuck. (In our
|
|||
|
|
implementation we prefer repairs that make small extensions to the language
|
|||
|
|
of the automaton over repairs that make larger extensions. In particular, we
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:24
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
first check whether there are r and s for which REPAIR r . s? can be applied. Then
|
|||
|
|
we check whether there are r and s for which REPAIR r? . s can be applied. Next,
|
|||
|
|
we check for REPAIR r + s and finally for REPAIR r? . s?.)
|
|||
|
|
Since the repair rules add edges to G, thereby increasing L(G), we may
|
|||
|
|
conclude the following theorem.
|
|||
|
|
THEOREM 21. For a SOA G, RWR always produces a SORE r with L(G) ⊆
|
|||
|
|
L(r). Moreover, if G is equivalent to a SORE, then L(G) = L(r).
|
|||
|
|
(The second statement follows by Theorem 16.) Combined with Proposition 9,
|
|||
|
|
we hence obtain the next corollary.
|
|||
|
|
COROLLARY 22.
|
|||
|
|
|
|||
|
|
Let M be the composition of 2T-INF with RWR, that is, M(S) :=
|
|||
|
|
|
|||
|
|
RWR(2T-INF(S)). Then M learns the class of SOREs from positive data.
|
|||
|
|
|
|||
|
|
4.2 Exploring the Search Space: RWR2
|
|||
|
|
When rewriting gets stuck, RWR arbitrarily selects a repair rule (perhaps based
|
|||
|
|
on some ordering of the rules as in our implementation), and discards the others. It should be clear, however, that when different repair rules are applicable,
|
|||
|
|
one rule may have a smaller impact on the language of the automaton than
|
|||
|
|
another. For that reason we present in this section a different modification
|
|||
|
|
of REWRITE that, in contrast to RWR, tries the “best” repair rules when there
|
|||
|
|
are several candidates. Here, the “best” repair rules are those that add the
|
|||
|
|
least number of words to the language. Since an automaton defines an infinite
|
|||
|
|
language in general, it is of course impossible to take all added words into
|
|||
|
|
account. We therefore only consider the words up to a length n, where n is twice
|
|||
|
|
the number of alphabet symbols in the automaton. Formally, for a language L,
|
|||
|
|
let |L≤n| denote the number of words in L of length at most n. Moreover, say
|
|||
|
|
that generalized SOA H is a repair of generalized SOA G if H is obtained by
|
|||
|
|
applying a repair rule on G. Then the repairs of the current automaton G are
|
|||
|
|
ordered according to increasing values of | L(H)≤n|, and the best (i.e., first)
|
|||
|
|
among them are further investigated.
|
|||
|
|
The resulting algorithm, called RWR2 (an abbreviation of REWRITE with
|
|||
|
|
best RANKED REPAIRS) is shown in Algorithm 4. Like REWRITE, it first checks
|
|||
|
|
whether its input G is equivalent to ∅ or ε. Otherwise, RWR2 uses RWR2 -AUX to
|
|||
|
|
Algorithm 4. RWR2
|
|||
|
|
Input: SOA G
|
|||
|
|
Output: a SORE r such that L(G) ⊆ L(r) if G is not equivalent to a SORE, and L(G) =
|
|||
|
|
L(r) otherwise.
|
|||
|
|
1: if sink is not reachable from src in G then
|
|||
|
|
2:
|
|||
|
|
return ∅
|
|||
|
|
3: else if E(G) = {(src, sink)} then
|
|||
|
|
4:
|
|||
|
|
return ε
|
|||
|
|
5: else
|
|||
|
|
6:
|
|||
|
|
initialize the final automaton Hopt to recognize (G)∗
|
|||
|
|
7:
|
|||
|
|
return the SORE corresponding to the final automaton computed by
|
|||
|
|
2
|
|||
|
|
RWR -AUX(G, Hopt )
|
|||
|
|
8: end if
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:25
|
|||
|
|
|
|||
|
|
Algorithm 5. RWR2 -AUX
|
|||
|
|
Input: generalized SOAs G and Hopt
|
|||
|
|
Output: final generalized SOA I such that L(G) ⊆ L(I) if G is not equivalent to a
|
|||
|
|
SORE, and L(G) = L(I) otherwise.
|
|||
|
|
1: while a rewrite rule from Figures 4–6 can be applied to G do
|
|||
|
|
2:
|
|||
|
|
perform the rewrite rule on G
|
|||
|
|
3: end while
|
|||
|
|
4: if G is final then
|
|||
|
|
5:
|
|||
|
|
return G
|
|||
|
|
6: else
|
|||
|
|
7:
|
|||
|
|
compute the set R of all possible repairs H of G
|
|||
|
|
8:
|
|||
|
|
sort R in increasing order by | L(H)≤n|
|
|||
|
|
9:
|
|||
|
|
for each of the min(, |R|) best repairs H do
|
|||
|
|
10:
|
|||
|
|
if | L(H)≤n| < | L(Hopt )≤n| then
|
|||
|
|
11:
|
|||
|
|
recursively compute H := RWR2 -AUX(H, Hopt )
|
|||
|
|
12:
|
|||
|
|
set Hopt := H if | L(H )≤n| < | L(Hopt )≤n|
|
|||
|
|
13:
|
|||
|
|
end if
|
|||
|
|
14:
|
|||
|
|
end for
|
|||
|
|
15:
|
|||
|
|
return Hopt
|
|||
|
|
16: end if
|
|||
|
|
|
|||
|
|
recursively rewrite and repair G until a final automaton is reached. During
|
|||
|
|
this recursion, Hopt is the best final generalized SOA found so far. Initially, on
|
|||
|
|
line 6 of RWR2 , Hopt is set to the final generalized SOA that accepts all words
|
|||
|
|
over alphabet symbols mentioned in G. RWR2 -AUX then rewrites G in lines 1–2
|
|||
|
|
until no more rewrite rule is applicable. If the resulting G is final then it is
|
|||
|
|
returned. Otherwise, RWR2 -AUX computes in line 6 all possible repairs H of G
|
|||
|
|
and orders them according to increasing values of | L(H)≤n|. The algorithm then
|
|||
|
|
recursively calls itself on the best ranked repairs in lines 8–10. The test in
|
|||
|
|
line 10 is an optimization: if the current repair is already worse than the best
|
|||
|
|
final generalized SOA Hopt computed so far in terms of language size, then
|
|||
|
|
further rewriting and repairing cannot yield a final generalized SOA that is
|
|||
|
|
better than Hopt . Lines 11 and 12 update Hopt when appropriate. Finally, Hopt
|
|||
|
|
is returned.
|
|||
|
|
Given its definition, it is clear that RWR2 results in regular expressions with
|
|||
|
|
a smaller language size for increasing values of , of course at the cost of
|
|||
|
|
increased computation time. In the experiments (Section 7.2) the trade-off between precision and computation time of RWR and RWR2 , for increasing values
|
|||
|
|
of , is investigated in more detail.
|
|||
|
|
4.3 Efficiently Computing the Language Size
|
|||
|
|
During its executing, RWR2 repeatedly needs to compute the language size of
|
|||
|
|
the possible repairs. This computation can actually be done quite efficiently
|
|||
|
|
for SOAs, as we show next. Of course, in general RWR2 needs to compute the
|
|||
|
|
language size also for generalized SOAs, not just ordinary SOAs. Our implementation first expands such generalized SOAs into an equivalent SOA using
|
|||
|
|
the Glushkov construction (similar to the ideas of the proof of Proposition 45
|
|||
|
|
in the online appendix that can be accessed in the ACM Digital Library), and
|
|||
|
|
then invokes the language size computation procedure explained next.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:26
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Let |L=m| denote the number of words in L of length exactly m. Let G be a
|
|||
|
|
SOA; and assume that V(G) − {src, sink} = {a1 , . . . , an}. Then consider the n × n
|
|||
|
|
matrix D where for i, j ∈ {1, . . . , n}
|
|||
|
|
|
|||
|
|
1 if (ai , a j ) ∈ E; and,
|
|||
|
|
D[i, j] =
|
|||
|
|
0 otherwise.
|
|||
|
|
In addition, define the 1 × n and n× 1 matrices I and F, respectively, as follows:
|
|||
|
|
for i, j ∈ {1, . . . , n}
|
|||
|
|
|
|||
|
|
1 if (src, j) ∈ E; and,
|
|||
|
|
I[1, j] =
|
|||
|
|
0 otherwise;
|
|||
|
|
and
|
|||
|
|
|
|||
|
|
|
|||
|
|
F[i, 1] =
|
|||
|
|
|
|||
|
|
1 if (i, sink) ∈ E; and,
|
|||
|
|
0 otherwise.
|
|||
|
|
|
|||
|
|
The following lemma is straightforward to prove by induction on n using
|
|||
|
|
the fact that each walk from src to sink in G uniquely determines an accepted
|
|||
|
|
word. Let Dm denote the m-times multiplication of D, with D0 the unit matrix.
|
|||
|
|
LEMMA 23.
|
|||
|
|
|
|||
|
|
Let m > 0 and let G be a SOA. Then | L(G)=m| = I · Dm−1 · F.
|
|||
|
|
|
|||
|
|
Since for m = 0, we simply have | L(G)=m| = 1 if (src, sink) ∈ E, and
|
|||
|
|
n
|
|||
|
|
| L(G)=m|, we can deter| L(G)=m| = 0, otherwise and since | L(G)≤n| = m=0
|
|||
|
|
≤n
|
|||
|
|
mine | L(G) | by iteratively computing the matrices D1 to Dm, and applying
|
|||
|
|
Lemma 23. This immediately gives the following corollary.
|
|||
|
|
COROLLARY 24.
|
|||
|
|
time O(n|G|3 ).
|
|||
|
|
|
|||
|
|
For each n > 0 and SOA G, | L(G)≤n| can be computed in
|
|||
|
|
|
|||
|
|
5. CORRECTION
|
|||
|
|
In the conference version of this article [Bex et al. 2006] we proposed a different set of rewrite and repair rules for transforming SOAs into SOREs. While
|
|||
|
|
those rewrite rules were claimed in Bex et al. [2006] to possess the analog of
|
|||
|
|
Proposition 19 (namely that they always produce a SORE equivalent to the
|
|||
|
|
input SOA, provided that such a SORE exists), this claim is false, as we will
|
|||
|
|
detail next. Readers unfamiliar with Bex et al. [2006] may freely skip this
|
|||
|
|
section without endangering comprehension of the rest of the article.
|
|||
|
|
To illustrate why the preceding claim is false, the rewrite rules of Bex et al.
|
|||
|
|
[2006] are given in Figure 10, where G∗ refers to the ε-closure of G, defined as
|
|||
|
|
follows.
|
|||
|
|
Definition 25. Let G = (V, E) be a generalized SOA. The ε-closure G∗ of G
|
|||
|
|
is the graph (V, E∗ ) where E∗ contains:
|
|||
|
|
—all edges of E;
|
|||
|
|
—all edges (r, r) with r = s+ or r = s+ ?;
|
|||
|
|
—all edges (r, s) for which there is a path from r to s in G that passes only
|
|||
|
|
through intermediate nodes t with ε ∈ L(t).
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:27
|
|||
|
|
|
|||
|
|
Fig. 10. Set of rewrite rules introduced in the conference version of this article [Bex et al. 2006].
|
|||
|
|
|
|||
|
|
Figure 11 shows a sequence of rewrite steps using these rules starting from
|
|||
|
|
the SOA recognizing (a + b)+ ? or, equivalently, (a? . b?)+ . Note that the second
|
|||
|
|
rewrite step, which introduces b?, causes the automaton to become disconnected: because a? ∈ PredG∗ (b) and sink ∈ SuccG∗ (b) − {b} it deletes (a?, sink)—
|
|||
|
|
the only edge linking src to sink. As such, the accepted language changes from
|
|||
|
|
L((a + b)+ ?) to ∅. This clearly illustrates that the OPTIONAL r? rule in Figure 10
|
|||
|
|
is unsound. For that reason, we have moved in this article to the new rewrite
|
|||
|
|
rules in Figures 4–6.
|
|||
|
|
It is peculiar, however, that we have extensively used the rewrite rules of
|
|||
|
|
Figures 10 together with the repair rules in Figure 13 in a prototype implementation but have never encountered a situation where:
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:28
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 11. A problematic sequence of rewrite steps using the rules in Figure 10. The input SOA
|
|||
|
|
accepts the same language as (a+b)+ ?, or, equivalently (a? . b?)+ . Note that the automaton resulting
|
|||
|
|
from by the second rewrite step is disconnected and hence accepts the empty language. Rewriting
|
|||
|
|
is therefore not sound.
|
|||
|
|
|
|||
|
|
Fig. 12. A succesfull sequence of rewrite steps using the rules in Figure 10. The input SOA accepts
|
|||
|
|
the same language as (a + b)+ ?, or, equivalently (a? . b?)+ .
|
|||
|
|
|
|||
|
|
—we obtained a SORE r that failed to accept at least all words in the input
|
|||
|
|
SOA G; or
|
|||
|
|
—we obtained a SORE r that accepted a strict superset of L(G) when G was
|
|||
|
|
equivalent to a SORE.
|
|||
|
|
We suspect that this behavior is due to the strict order in which we apply the
|
|||
|
|
rewrite rules in our implementation: first CONCATENATION, then DISJUNCTION,
|
|||
|
|
then SELF-LOOP, and finally OPTIONAL. To illustrate, Figure 12 shows a successful
|
|||
|
|
rewriting of the SOA accepting (a + b)+ ? under this order.
|
|||
|
|
The inference algorithm of Bex et al. [2006], which we shall call RWR0 in this
|
|||
|
|
article, is shown in Algorithm 6. It is based on the rewrite rules in Figure 10
|
|||
|
|
and the repair rules in Figure 13. The experiments in Section 7 indicate that
|
|||
|
|
0
|
|||
|
|
2
|
|||
|
|
RWR has no benefits over RWR and RWR . Moreover, as we do not have a formal
|
|||
|
|
soundness and completeness proof showing that rewriting always produces a
|
|||
|
|
SORE equivalent to the input SOA (provided that such a SORE exists) under
|
|||
|
|
this order, it does not make much sense to consider RWR0 for the class of SOREs.
|
|||
|
|
In strong contrast, on the class of k-occurrence regular expressions (k > 1), RWR0
|
|||
|
|
can make a difference over RWR and RWR2 [Bex et al.]. So even without formal
|
|||
|
|
guarantees, RWR0 still has its its merits.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:29
|
|||
|
|
|
|||
|
|
Algorithm 6. RWR0
|
|||
|
|
Input: a SOA G
|
|||
|
|
Output: a SORE r
|
|||
|
|
1: if sink is not reachable from src in G then
|
|||
|
|
2:
|
|||
|
|
return ∅
|
|||
|
|
3: else if E(G) = {(src, sink)} then
|
|||
|
|
4:
|
|||
|
|
return ε
|
|||
|
|
5: else
|
|||
|
|
6:
|
|||
|
|
initialize done to false
|
|||
|
|
7:
|
|||
|
|
while not done do
|
|||
|
|
8:
|
|||
|
|
if there a rewrite rule in Figure 10 is applicable then
|
|||
|
|
9:
|
|||
|
|
rewrite G, giving precedence to CONCATENATION, then DISJUNCTION, then SELFLOOP, then OPTIONAL
|
|||
|
|
10:
|
|||
|
|
else if a repair rule in Figure 13 is applicable then
|
|||
|
|
11:
|
|||
|
|
repair G, giving precedence to ENABLE-DISJUNCTION, then ENABLE-OPTIONAL-1,
|
|||
|
|
then ENABLE-OPTIONAL-2
|
|||
|
|
12:
|
|||
|
|
else
|
|||
|
|
13:
|
|||
|
|
set done to true
|
|||
|
|
14:
|
|||
|
|
end if
|
|||
|
|
15:
|
|||
|
|
end while
|
|||
|
|
16:
|
|||
|
|
if G is final then
|
|||
|
|
17:
|
|||
|
|
return the corresponding regular expression r
|
|||
|
|
18:
|
|||
|
|
else
|
|||
|
|
19:
|
|||
|
|
return ∅
|
|||
|
|
20:
|
|||
|
|
end if
|
|||
|
|
21: end if
|
|||
|
|
|
|||
|
|
6. INFERRING CHARES: CRX
|
|||
|
|
In this section, we present the algorithm CRX for the inference of chain regular
|
|||
|
|
expressions (CHAREs).
|
|||
|
|
Definition 26 (CHAREs ). The class of chain regular expressions consists of
|
|||
|
|
those SOREs of the form f1 · · · fn where every fi is a chain factor—an expression
|
|||
|
|
of the form (a1 + · · · + ak), (a1 + · · · + ak)?, (a1 + · · · + ak)+ , or, (a1 + · · · + ak)+ ? with
|
|||
|
|
k ≥ 1 and every ai is an alphabet symbol.
|
|||
|
|
For instance, the expression a(b+c)+ ?d+ (e + f )? is a CHARE, while (ab+c)+ ?
|
|||
|
|
and (a+ ? + b?)+ ? are not.
|
|||
|
|
Since each CHARE is a concatenation of alphabet-disjoint chain factors,
|
|||
|
|
every occurrence of an alphabet symbol in a word must be generated by the
|
|||
|
|
same chain factor in the target CHARE. The positional relationships between
|
|||
|
|
occurrences of alphabet symbols in a given sample then allow us to deduce
|
|||
|
|
which chain factors are present in the target CHARE, and how they are ordered.
|
|||
|
|
Example 27. Consider the sample S = {u, v, w} with u = abd, v = bcdee,
|
|||
|
|
and w = cade. Clearly a occurs before b in u, b occurs before c in v, and c occurs
|
|||
|
|
before a in w. In the target CHARE, therefore, a, b, and c must belong to the
|
|||
|
|
same chain factor which can only be (a + b + c)+ or (a + b + c)+ ?. Since one of
|
|||
|
|
{a, b, c} is present in every word of S, we choose (a + b + c)+ . Similarly, d and
|
|||
|
|
e form chain factors by themselves. Whereas d occurs once in every word in S,
|
|||
|
|
e can occur zero, one, or more times. Therefore, d is represented by the chain
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:30
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Fig. 13. Repair rules accompanying the rewrite rules in Figure 10. These rules are a correction
|
|||
|
|
of the rules presented in Bex et al. [2006]. Repairs are tried in the order shown. In particular,
|
|||
|
|
ENABLE-OPTIONAL-2 is only applied if none of the other rules is applicable.
|
|||
|
|
|
|||
|
|
factor d, while e is represented by the chain factor e+ ?. Since a, b, c always occur
|
|||
|
|
before d, which in turn always occurs before the e’s, the derived CHARE is then
|
|||
|
|
(a + b + c)+ de+ ?.
|
|||
|
|
So, in brief, CRX computes chain factors, orders them, and uses that order to
|
|||
|
|
generate a CHARE. Of course, the order of the chain factors is not necessarily
|
|||
|
|
linear. In that case, a linear order can be constructed by making the factors
|
|||
|
|
optional. Some care has to be taken, however, to generate factors that are
|
|||
|
|
disjunctions without repetitions.
|
|||
|
|
Definition 28. Let S be a sample. We denote by → S the partial preorder on
|
|||
|
|
such that a → S b if, and only if, a immediately precedes b in some w ∈ S.
|
|||
|
|
(I.e., ab is a 2-gram of S.) We say that a occurs before b in S if a →∗S b, where
|
|||
|
|
→∗S is the reflexive and transitive closure of → S.
|
|||
|
|
For instance, Figure 14 illustrates → S when S = {abccde, cccad, bf egg,
|
|||
|
|
bf ehi}.
|
|||
|
|
Definition 29. Define a ≈ S b if a occurs before b in S and b occurs before a.
|
|||
|
|
That is, a ≈ S b if a →∗S b and b →∗S a.
|
|||
|
|
Clearly, ≈ S is an equivalence relation. Let S denote the set of equivalence classes of ≈ S. In what follows, we denote such equivalence classes by, for
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:31
|
|||
|
|
|
|||
|
|
Fig. 14. The partial preorder → S for S = {abccde, cccad, bf egg, bf ehi}.
|
|||
|
|
|
|||
|
|
Fig. 15. The Hasse diagram HS of the sample S = {abccde, cccad, bf egg, bf ehi}. The corresponding
|
|||
|
|
partial preorder from which HS is derived is shown in Figure 14.
|
|||
|
|
|
|||
|
|
example, [a1 , . . . , an]. As usual, an equivalence class of cardinality 1 is called a
|
|||
|
|
singleton.
|
|||
|
|
Definition 30. The Hasse diagram of S, denoted HS, is the graph over S
|
|||
|
|
in which there is an edge from equivalence class [a1 , . . . , an] to class [b1 , . . . , bm]
|
|||
|
|
if: (1) [a1 , . . . , an] and [b1 , . . . , bm] are distinct and (2) there exists 1 ≤ i ≤ n and
|
|||
|
|
1 ≤ j ≤ m such that ai → S b j .
|
|||
|
|
For instance, the Hasse diagram of the sample S = {abccde, cccad, bf egg,
|
|||
|
|
bf ehi} is shown in Figure 15. The operation of CRX is then shown in Algorithm 7
|
|||
|
|
and illustrated in the following example.
|
|||
|
|
Example 31. Consider again the sample S = {abccde, cccad, bf egg, bf ehi}
|
|||
|
|
and its corresponding Hasse diagram in Figure 15. Since Pred HS ([d]) =
|
|||
|
|
Pred HS ([ f ]) and Succ HS ([d]) = Succ HS ([ f ]), line 3 applies to [d] and [ f ]. Although
|
|||
|
|
Pred HS ([g]) = Pred HS ([h]), step 2 cannot be applied as Succ HS ([g]) = Succ HS ([h]).
|
|||
|
|
Similarly [g] and [i] share successors, that is, ∅, but have different predecessors.
|
|||
|
|
Hence, after the while loop in line 2 we obtain:
|
|||
|
|
|
|||
|
|
A possible topological sort is [a, b, c], [d, f ], [e], [g], [h], [i]. Since at least one of
|
|||
|
|
a, b, and c occurs once or more in every string of W, r([a, b, c]) = (a + b + c)+ is
|
|||
|
|
the first factor; the second factor is (d + f ) since either d or f occurs exactly
|
|||
|
|
once; the factor derived from [e] is e? since W contains a string without e
|
|||
|
|
and similarly for those from [h] and [i]. Finally, g occurs multiple times in a
|
|||
|
|
single string. Hence the simple regular expression derived by the algorithm is
|
|||
|
|
(a + b + c)+ · (d + f ) · e? · g+ ? · h? · i? which completes step 6.
|
|||
|
|
Note that the order of the chain factors in the CHARE depends on the
|
|||
|
|
topological sort.
|
|||
|
|
THEOREM 32.
|
|||
|
|
L(S).
|
|||
|
|
|
|||
|
|
Given a sample S, CRX computes a CHARE r such that S ⊆
|
|||
|
|
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:32
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Algorithm 7. CRX
|
|||
|
|
Input: a sample S
|
|||
|
|
Output: a CHARE r such that S ⊆ L(r)
|
|||
|
|
1: Compute the set S of equivalence classes of ≈ S
|
|||
|
|
2: while a maximal set of singleton nodes γ1 , . . . , γ such that Pred HS (γ1 ) = · · · =
|
|||
|
|
Pred HS (γ ) and Succ HS (γ1 ) = · · · = Succ HS (γ ) exists do
|
|||
|
|
3:
|
|||
|
|
Replace γ1 , . . . , γ by γ := ∪j=1 γ j , and redirect all incoming and outgoing edges of
|
|||
|
|
the γi to γ in HS
|
|||
|
|
4: end while
|
|||
|
|
5: Compute a topological sort γ1 , . . . , γk of the nodes
|
|||
|
|
6: for all i ∈ {1, . . . , k} (γi = [a1 , . . . , an]) do
|
|||
|
|
7:
|
|||
|
|
if every w ∈ S contains exactly one occurrence of a symbol in {a1 , . . . , an} then
|
|||
|
|
8:
|
|||
|
|
r(γi ) := (a1 + · · · + an)
|
|||
|
|
9:
|
|||
|
|
else if every w ∈ S contains at most one occurrence of a symbol in {a1 , . . . , an}
|
|||
|
|
then
|
|||
|
|
10:
|
|||
|
|
r(γi ) := (a1 + · · · + an)?
|
|||
|
|
11:
|
|||
|
|
else if every w ∈ S contains at least one of a1 , . . . , an and there is a word that
|
|||
|
|
contains at least two occurrences of symbols then
|
|||
|
|
12:
|
|||
|
|
r(γi ) := (a1 + · · · + an)+
|
|||
|
|
13:
|
|||
|
|
else
|
|||
|
|
14:
|
|||
|
|
r(γi ) := (a1 + · · · + an)+ ?
|
|||
|
|
15:
|
|||
|
|
end if
|
|||
|
|
16:
|
|||
|
|
return r(γ1 ) . r(γ2 ) . · · · . r(γk)
|
|||
|
|
17: end for
|
|||
|
|
|
|||
|
|
PROOF. The theorem follows almost immediately from the construction.
|
|||
|
|
Clearly, CRX always outputs a CHARE. Moreover, observe that after step 5
|
|||
|
|
the computed topological sort is consistent with the order of the symbols in the
|
|||
|
|
words in S. More precisely, there can not exist symbols a and b, such that a ∈ γi ,
|
|||
|
|
b ∈ γ j , i < j, and b →∗S a. Subsequently, for each γi a chain factor is chosen
|
|||
|
|
in such a manner that it is consistent with all words w ∈ S. As these factors
|
|||
|
|
are ordered consistently with the order of the symbols in S, this implies that
|
|||
|
|
S ⊆ L(r).
|
|||
|
|
Furthermore, on the class of CHAREs, CRX is complete.
|
|||
|
|
THEOREM 33.
|
|||
|
|
L(CRX(S)).
|
|||
|
|
|
|||
|
|
For each CHARE r there is a sample S such that L(r) =
|
|||
|
|
|
|||
|
|
PROOF. Denote by Sym(r) the set of alphabet symbols occurring in r. We also
|
|||
|
|
abuse notation and, for a sample S, write Sym(S) to denote the set of alphabet
|
|||
|
|
symbols occurring in S. Let r = f1 · · · fk be a CHARE, with each fi a chain
|
|||
|
|
factor. We construct the sample S such that the CRX(S) is syntactically equal to
|
|||
|
|
r, up to commutativity of +. The theorem then follows.
|
|||
|
|
Thereto, for every 1 ≤ i ≤ k, let wi be a word in L( fi ). We construct S by
|
|||
|
|
subsequently adding words to it. First, for all 1 ≤ i ≤ k − 1, a ∈ Sym( fi ),
|
|||
|
|
b ∈ Sym( fi+1 ), we add w1 · · · wi−1 abwi+2 · · · wk to S. Further, for all 1 ≤ i ≤ k,
|
|||
|
|
we add words to S, depending on the form of fi . Specifically, if fi is of the
|
|||
|
|
form:
|
|||
|
|
—(a1 + · · · + an), we add w1 · · · wi−1 a1 wi+1 · · · wk;
|
|||
|
|
—(a1 + · · · + an)?, we add w1 · · · wi−1 wi+1 · · · wk, and w1 · · · wi−1 a1 wi+1 · · · wk;
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:33
|
|||
|
|
|
|||
|
|
—(a1 + · · · + an)+ , we add w1 · · · wi−1 a1 a1 wi+1 · · · wk;
|
|||
|
|
—(a1 + · · · + an)+ ?, we add w1 · · · wi−1 wi+1 · · · wk, and w1 · · · wi−1 a1 a1 wi+1 · · · wk.
|
|||
|
|
We now argue that given S, CRX indeed derives an expression syntactically
|
|||
|
|
equal to r. First observe that already before step 3, CRX computes k nodes γ1 to
|
|||
|
|
γk, which are linearly ordered, such that for each 1 ≤ i ≤ k, γi contains exactly
|
|||
|
|
the alphabet symbols contained in fi . Then, due to the number of occurrences
|
|||
|
|
of each symbol of the different chain factors, the algorithm will associate to
|
|||
|
|
each γi exactly the factor fi , and hence CRX(S) is syntactically equivalent to r,
|
|||
|
|
up to commutativity of +.
|
|||
|
|
From Theorems 32 and 33 it readily follows that we have the next corollary.
|
|||
|
|
COROLLARY 34.
|
|||
|
|
|
|||
|
|
CRX learns the class of CHAREs from positive data.
|
|||
|
|
|
|||
|
|
The experiments in Section 7.3 show that the number of words in S needed
|
|||
|
|
in practice is very small. Actually, the prime feature that makes CRX much
|
|||
|
|
more robust than RWR for very small datasets is its strong generalization ability. Indeed, consider an expression of the form (a1 + · · · + an)+ ?. While REWRITE
|
|||
|
|
requires all n2 2-grams of the form ai a j for i, j ∈ {1, . . . , n} to be present, RWR
|
|||
|
|
requires around (n2 − n) 2-grams. For CRX, however, the set {ε, a1 a2 , a2 a3 , . . . ,
|
|||
|
|
an−1 an, ana1 } of size O(n) will suffice. This point is illustrated in practice
|
|||
|
|
by example3 and example4 in Table II where n has a value of 41 and 56,
|
|||
|
|
respectively. Experiments illustrate that only 400 1682 and 500 3136
|
|||
|
|
2-grams are needed by CRX to learn example3 and example4, respectively.
|
|||
|
|
The following theorem shows that CRX is optimal within the class of CHAREs
|
|||
|
|
when the partial order S is in fact a linear order.
|
|||
|
|
THEOREM 35. For every sample S, if S is a linear order then for every
|
|||
|
|
CHARE r such that S ⊆ L(r) and L(r) ⊆ L(CRX(S)), we have r = CRX(S), that is, r
|
|||
|
|
is syntactically equal to CRX(S) up to commutativity of +.
|
|||
|
|
PROOF. Assume that CRX(S) = f1 · · · fk and r = g1 · · · gl . Clearly,
|
|||
|
|
Sym(CRX(S)) = Sym(r) = Sym(S). We first argue that k = l. Thereto, assume
|
|||
|
|
for the purpose of contradiction that k < l. Then, there is a chain factor f in
|
|||
|
|
CRX(S) with a, b ∈ Sym( f ) and two chain factors g and g in r with a ∈ Sym(g)
|
|||
|
|
and b ∈ Sym(g ). We distinguish two cases.
|
|||
|
|
(1) If f is of the form (a1 + · · · + an) or (a1 + · · · + an)?, then L(r) ⊆ L(CRX(S)).
|
|||
|
|
(2) If f is of the form (a1 + · · · + an)+ ? or (a1 + · · · + an)+ , by construction and
|
|||
|
|
since S is linearly ordered, there are words u1 , u2 ∈ S such that a →∗u1 b
|
|||
|
|
and b →∗u2 a. However, since a and b are in different chain factors of r,
|
|||
|
|
/ L(r) or u2 ∈
|
|||
|
|
/ L(r), and hence S ⊆ L(r).
|
|||
|
|
either u1 ∈
|
|||
|
|
Conversely, assume k > l. Then, there are chain factors f, f in CRX(S) with
|
|||
|
|
a ∈ Sym( f ) and b ∈ Sym( f ), and a chain factor g in r with a, b ∈ Sym(g). We
|
|||
|
|
again distinguish two cases.
|
|||
|
|
(1) If g is of the form (a1 + · · · + an)+ ? or (a1 + · · · + an)+ , then L(r) ⊆ L(CRX(S)).
|
|||
|
|
(2) If g is of the form (a1 +· · ·+an) or (a1 +· · ·+an)?, by construction and since S
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:34
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
is linearly ordered, there are words u1 , . . . , um ∈ S, and symbols c1 , . . . , cm−1
|
|||
|
|
such that a →∗u1 c1 , cm →∗um b, and ci →ui+1 ci+1 , for all 1 ≤ i ≤ m − 1.
|
|||
|
|
/ L(r) must
|
|||
|
|
However, due to the form of g, for at least one of these ui , ui ∈
|
|||
|
|
hold and hence S ⊆ L(r).
|
|||
|
|
Using the same kind of argument it can be shown that Sym( fi ) = Sym(gi ),
|
|||
|
|
for all 1 ≤ i ≤ k. Further, since L(r) ⊆ L(CRX(S)), for every 1 ≤ i ≤ k, we
|
|||
|
|
have L(gi ) ⊆ L( fi ). Since the different chain factors can only take a restricted
|
|||
|
|
numbers of forms, it now suffices to show that L(gi ) = L( fi ), for all i, to show that
|
|||
|
|
they are also syntactically equivalent. Hence, towards a contradiction, assume
|
|||
|
|
L(gi ) L( fi ) for some 1 ≤ i ≤ k. This can only be the case if: (1) gi = (a1 +· · ·+an)
|
|||
|
|
and fi = (a1 + · · · + an); (2) gi = (a1 + · · · + an)+ ? and fi = (a1 + · · · + an)+ ; or
|
|||
|
|
(3) gi = (a1 + · · · an)? and fi is one of the three other forms. However, in each of
|
|||
|
|
these cases, given the construction of the algorithm, one can find a word w ∈ S
|
|||
|
|
such that w ∈
|
|||
|
|
/ L(r). Hence, for all i, L( fi ) = L(gi ), and thus r = CRX(S).
|
|||
|
|
Note that this property does not hold when S is not linear. For instance, on
|
|||
|
|
S = {abc, ade, abe} CRX yields a·b?·d?·c?·e? whereas the CHARE a·(b+d)·(c +e)
|
|||
|
|
is a better approximation of the target language.
|
|||
|
|
CRX can be efficiently executed on very large datasets by only maintaining
|
|||
|
|
HS and the multiplicities of occurrences of -symbols in words in S (needed for
|
|||
|
|
lines 6–13). From this representation, lines 2–5 can be executed. Hence, it is
|
|||
|
|
not necessary that the entire sample resides in main memory. The complexity
|
|||
|
|
of the algorithm is O(m + n3 ), where m is the size of the sample and n the
|
|||
|
|
number of alphabet symbols.
|
|||
|
|
7. EXPERIMENTAL EVALUATION
|
|||
|
|
In this section we validate our approach by means of experimental analysis.
|
|||
|
|
Specifically, we assess the quality of the expressions returned by our algorithms
|
|||
|
|
on real-world corpora and DTDs, and compare it with the quality of expressions
|
|||
|
|
returned by XTRACT [Garofalakis et al. 2003] and Trang [Clark]. Next, we compare the quality of RWR0 (the algorithm found in the conference version of this
|
|||
|
|
article), RWR, and RWR2 . Subsequently, we investigate the performance of the algorithms on incomplete and noisy data. Finally, we discuss their running time
|
|||
|
|
performance. We abuse notation and simply write RWR for the application of
|
|||
|
|
2T-INF followed by RWR, similarly for RWR0 and RWR2 . All experiments were performed using a prototype implementation of our algorithms in Java executed
|
|||
|
|
on a 2.5 Ghz Pentium 4 machine with 1GB of RAM.
|
|||
|
|
7.1 Real-World Examples
|
|||
|
|
The number of publicly available XML corpora is rather limited. We employed
|
|||
|
|
the XML Data repository maintained by Miklau [2002] as a testbed. Unfortunately, most of the corpora listed there are either very small, lack a DTD,
|
|||
|
|
or contain a DTD with only trivial regular expressions. Nevertheless, two of
|
|||
|
|
the listed corpora are interesting. Specifically, we compared XTRACT, RWR, and
|
|||
|
|
CRX on the Protein Sequence Database (683Mb in size) and the Mondial corpus
|
|||
|
|
[Miklau 2002], a database of information on various countries (1Mb in size).
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:35
|
|||
|
|
|
|||
|
|
Table I. Results of RWR, CRX and XTRACT on DTDs and Sample Data from
|
|||
|
|
the Protein Description Database and the Mondial Corpora
|
|||
|
|
Element
|
|||
|
|
Original DTD
|
|||
|
|
Sample
|
|||
|
|
Result of CRX/ RWR
|
|||
|
|
size
|
|||
|
|
Result of XTRACT
|
|||
|
|
ProteinE.
|
|||
|
|
a1 a2 a3 a4 + ?a5 + ?a6 + ?a7 + ?a8 + ?a9 ?a10 ?a11 + ?a12 a13
|
|||
|
|
2458
|
|||
|
|
a1 a2 a3 a4 + a5 + ?a6 + ?a7 + ?a8 + ?a9 ?a10 ?a11 + ?a12 a13
|
|||
|
|
843
|
|||
|
|
an expression of 185 tokens
|
|||
|
|
organism
|
|||
|
|
a1 a2 ?a3 a4 ?a5 + ?
|
|||
|
|
9
|
|||
|
|
a1 a2 ?a3 a4 ?a5 + ?
|
|||
|
|
9
|
|||
|
|
a1 ((a2 a3 a4 ?+a3 a4 )a5 ?+a3 a5 + ?)
|
|||
|
|
reference
|
|||
|
|
a1 a2 + ?a3 + ?a4 + ?
|
|||
|
|
45
|
|||
|
|
a1 a2 + ?a3 + ?a4 + ?
|
|||
|
|
45
|
|||
|
|
a1 (a2 + ?(a4 + ?+a3 + ?)+a2 a3 + ?a4 a4 +a3 + ?a4 + ?)
|
|||
|
|
refinfo
|
|||
|
|
a1 a2 a3 ?a4 ?a5 a6 ?(a7 +a8 )?a9 ?
|
|||
|
|
10
|
|||
|
|
a1 a2 (a3 +a4 )?a5 a6 ?a7 ?a9 ?a8 ?
|
|||
|
|
10
|
|||
|
|
a1 a2 ((a3 a5 a6 a7 ?+a4 a5 )a9 ?+a5 (a7 +a8 )?+a4 a5 a8 )
|
|||
|
|
authors
|
|||
|
|
a1 + +(a2 a3 ?)
|
|||
|
|
54
|
|||
|
|
a1 + ?a2 ?a3 ? /
|
|||
|
|
a1 + +(a2 a3 )
|
|||
|
|
54
|
|||
|
|
a1 + ?+a2 a3
|
|||
|
|
accinfo
|
|||
|
|
a1 a2 + ?a3 + ?a4 ?a5 ?a6 ?a7 + ?
|
|||
|
|
124
|
|||
|
|
a1 a2 + ?a3 + a4 ?a5 ?a6 ?a7 + ?
|
|||
|
|
124
|
|||
|
|
an expression of 97 tokens
|
|||
|
|
genetics
|
|||
|
|
a1 + ?a2 ?a3 ?a4 ?a5 ?a6 ?a7 ?a8 ?a9 ?a10 ?a11 + ?a12 + ?
|
|||
|
|
219
|
|||
|
|
a1 + ?a2 ?a3 ?a4 ?a5 ?a6 ?a7 ?a8 ?a9 ?a10 ?a12 + ?
|
|||
|
|
219
|
|||
|
|
an expression of 329 tokens
|
|||
|
|
function
|
|||
|
|
a1 ?a2 + ?a3 + ?
|
|||
|
|
26
|
|||
|
|
a1 ?a2 + ?a3 + ?
|
|||
|
|
26
|
|||
|
|
(a1 (a2 ?a2 ?a3 + ?+a2 + ?(a3 a3 )+ ?+a2 a2 a2 a3 )+a2 (a2 a3 + ?+a3 + ?))
|
|||
|
|
city
|
|||
|
|
a1 a2 + ?a3 + ?
|
|||
|
|
9
|
|||
|
|
a1 a2 + ?a3 + ?
|
|||
|
|
9
|
|||
|
|
a1 (a2 + ?a3 a3 ?+a2 (a3 + ?+a2 ))?
|
|||
|
|
The left column gives element names, sample size for CRX/ RWR, and sample size for
|
|||
|
|
XTRACT, respectively. The right column lists original DTD, inferred DTD by CRX/ RWR,
|
|||
|
|
and the result of XTRACT, in that order.
|
|||
|
|
|
|||
|
|
Since no real-world data could be obtained for SOREs that are not CHAREs,
|
|||
|
|
we generated our own XML data for a number of real-world DTDs considered
|
|||
|
|
in Bex et al. [2004] containing a number of sophisticated regular expressions
|
|||
|
|
outside the class of CHAREs.
|
|||
|
|
Real-world data. In this section, we only discuss RWR as RWR0 and RWR2 give
|
|||
|
|
precisely the same results. Table I lists all nontrivial element definitions2 in
|
|||
|
|
the aforementioned DTDs together with the results derived by the inference
|
|||
|
|
algorithms RWR, CRX, and XTRACT. It is interesting to note that only the regular
|
|||
|
|
expression for authors is not a CHARE. Moreover, no elements are repeated
|
|||
|
|
in any of the definitions. This should not come as a surprise given the observations discussed in the Introduction on the content models occurring in practice.
|
|||
|
|
The regular expression derived by the XTRACT algorithm is shown whenever
|
|||
|
|
it fitted the table, otherwise the number of tokens it consists of is listed. For
|
|||
|
|
better readability the actual output of XTRACT has been simplified by replacing
|
|||
|
|
expressions such as (ai + ε) by ai ?.
|
|||
|
|
2 It should be noted that the examples from the Mondial corpus are not valid according to their
|
|||
|
|
DTD, so for the city element only valid elements were used as training examples.
|
|||
|
|
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:36
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
It can be verified that all regular expressions in Table I are learned quite
|
|||
|
|
satisfactory by RWR and CRX with respect to the examples extracted from the
|
|||
|
|
XML corpus. The numbers in the first column refer to the size of the sample.
|
|||
|
|
RWR and CRX always produce the same result except for authors where CRX
|
|||
|
|
cannot derive the target expression as it is not a CHARE. We note that no
|
|||
|
|
sample was representative of its target expression. As such, RWR always had to
|
|||
|
|
apply repair rules. The expressions in the table indicate that the result of these
|
|||
|
|
repairs are satisfactory. For a few expressions, for instance, ProteinE(ntry),
|
|||
|
|
refinfo, and genetics, the expressions produced by CRX and RWR are more
|
|||
|
|
strict than the corresponding one in the DTD. This is due to the data present
|
|||
|
|
in the sample. For instance, for genetics, no a11 element occurs in the sample
|
|||
|
|
so it obviously cannot be part of the derived expression. The element refinfo
|
|||
|
|
illustrates that a3 and a4 are mutually exclusive in the sample and that a8 is
|
|||
|
|
never followed by a9 . Inspecting the original DTD illustrates the underlying
|
|||
|
|
semantics.
|
|||
|
|
authors, citation, volume?, month?, year,
|
|||
|
|
pages?, (title | description)?, xrefs?
|
|||
|
|
Indeed, volume is used in the context of a journal, while month is used for a
|
|||
|
|
conference publication. Apart from the authors element XTRACT either produces
|
|||
|
|
a suboptimal expression or no expression at all. For instance, XTRACT crashes on
|
|||
|
|
the ProteinE(ntry) sample due to excessive memory consumption (more than
|
|||
|
|
1GB of RAM). Reducing the size of the sample to approximately 800 unique
|
|||
|
|
words yields a complex expression of 185 tokens.
|
|||
|
|
Real-world regular expressions. Table II lists the results of the algorithms on
|
|||
|
|
a number of more sophisticated regular expressions extracted from real-world
|
|||
|
|
DTDs discussed in Bex et al. [2004]. Since no real-world data was available
|
|||
|
|
for those DTDs, we have randomly generated samples using ToXgene [Barbosa
|
|||
|
|
et al. 2002], taking care that all relevant examples where present to ensure
|
|||
|
|
the target expression could be learned. Again, we list the sample size in the
|
|||
|
|
first column. As some of these numbers might seem artificially large, we note
|
|||
|
|
that, for instance, the SOA corresponding to example3 already contains 1897
|
|||
|
|
edges. Hence, a random dataset of 5741 words is not unreasonably large. Note
|
|||
|
|
that only the first three expressions in Table II are SOREs, none of them
|
|||
|
|
is a CHARE. The table shows clearly that CRX yields fairly good and concise
|
|||
|
|
super-approximations to the original expressions. In some cases, the results
|
|||
|
|
produced by RWR are more precise. For XTRACT, the size of the sample had to be
|
|||
|
|
limited to 300–500 in order to avoid a crash. As can be seen from the table,
|
|||
|
|
XTRACT performed excellently on the first example, but failed to generate an
|
|||
|
|
expression that fitted the table in all other cases on all the sample sets we
|
|||
|
|
tried.
|
|||
|
|
Trang. We ran Trang [Clark] on the XML data discussed in this section.
|
|||
|
|
In all but one case, Trang produced exactly the same output as CRX, with a
|
|||
|
|
notable exception: for example1 Trang’s output depends on the order in which
|
|||
|
|
the examples are presented, yielding either a1 + ?a2 ?a3 + ? or a1 + + (a2 ?a3 + ). The
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:37
|
|||
|
|
|
|||
|
|
Table II. Results of RWR, CRX and XTRACT on
|
|||
|
|
Nonsimple Real-World DTDs and Generated Data
|
|||
|
|
Original DTD
|
|||
|
|
Element
|
|||
|
|
Result of CRX
|
|||
|
|
Sample
|
|||
|
|
Result of RWR
|
|||
|
|
size
|
|||
|
|
Result of XTRACT
|
|||
|
|
example1
|
|||
|
|
a1 + + (a2 ?a3 + )
|
|||
|
|
48
|
|||
|
|
a1 + ?a2 ?a3 + ?
|
|||
|
|
48
|
|||
|
|
a1 + + (a2 ?a3 + )
|
|||
|
|
48
|
|||
|
|
a1 + ? + (a2 ?a3 + ?)
|
|||
|
|
example2
|
|||
|
|
(a1 a2 ?a3 ?)?a4 ?(a5 + · · · + a18 )+ ?
|
|||
|
|
2210
|
|||
|
|
a1 ?a2 ?a3 ?a4 ?(a5 + · · · + a18 )+ ?
|
|||
|
|
2210
|
|||
|
|
(a1 a2 ?a3 ?)?a4 ?(a5 + · · · + a18 )+ ?
|
|||
|
|
300
|
|||
|
|
an expression of 252 tokens
|
|||
|
|
example3
|
|||
|
|
a1 ?(a2 a3 ?)?(a4 + · · · + a44 )+ ?a45 +
|
|||
|
|
5741
|
|||
|
|
a1 ?a2 ?a3 ?(a4 + · · · + a44 )+ ?a45 +
|
|||
|
|
5741
|
|||
|
|
a1 ?(a2 a3 ?)?(a4 + · · · + a44 )+ ?a45 +
|
|||
|
|
400
|
|||
|
|
an expression of 142 tokens
|
|||
|
|
example4 a1 ?a2 a3 ?a4 ?(a5 + + ((a6 + · · · + a61 )+ a5 + ?))
|
|||
|
|
10000
|
|||
|
|
a1 ?a2 a3 ?a4 ?(a6 + · · · + a61 )+ ?a5 + ?
|
|||
|
|
10000
|
|||
|
|
a1 ?a2 a3 ?a4 ?(a6 + · · · + a61 )+ ?a5 + ?
|
|||
|
|
500
|
|||
|
|
an expression of 185 tokens
|
|||
|
|
+
|
|||
|
|
example5
|
|||
|
|
a1 (a2 + a3 )+ ?(a4 (a2 + a3 + a5 )+ ?) ?
|
|||
|
|
+
|
|||
|
|
1281
|
|||
|
|
a1 (a2 + a3 + a4 + a5 ) ?
|
|||
|
|
+
|
|||
|
|
1281
|
|||
|
|
a1 ((a2 + a3 + a4 )+ a5 + ?) ?
|
|||
|
|
500
|
|||
|
|
an expression of 85 tokens
|
|||
|
|
The left column gives element names, sample size for CRX,
|
|||
|
|
RWR and XTRACT, respectively. The right column lists original
|
|||
|
|
DTD, inferred DTD by CRX, by RWR and the result of XTRACT,
|
|||
|
|
in that order.
|
|||
|
|
|
|||
|
|
former is the same output as CRX, the latter is the intended RE that cannot
|
|||
|
|
be derived by CRX as it is outside the class of CHAREs. This inconsistency in
|
|||
|
|
Trang’s output casts some doubt on its correctness and underscores the need
|
|||
|
|
for a formal model as the cornerstone of an implementation. Indeed, there is no
|
|||
|
|
article or manual available describing the machinery underlying Trang. A look
|
|||
|
|
at the Java-code indicates that Trang is related to, but different from, CRX: it
|
|||
|
|
uses 2T-INF to construct an automaton, eliminates cycles by merging all nodes
|
|||
|
|
in the same strongly connected component, and then transforms the obtained
|
|||
|
|
DAG into a regular expression. However, no target class of REs for which Trang
|
|||
|
|
is complete, as is the case for CRX, is specified. As Trang is similar to CRX, it is
|
|||
|
|
outperformed by RWR and RWR2 .
|
|||
|
|
7.2 RWR versus RWR2
|
|||
|
|
We tested the results and performance of RWR versus RWR2 for various values
|
|||
|
|
of the rank cut-off parameter . The SOAs used in this test were randomly
|
|||
|
|
generated with 5 and 10 alphabet symbols. The results are summarized in
|
|||
|
|
Table III(a). We computed the average language size of the SOAs, which is the
|
|||
|
|
target size. It should be noted that since no SORE corresponds to these SOAs,
|
|||
|
|
the target size can never be attained since the regular expression resulting
|
|||
|
|
from RWR or RWR2 will necessarily be a generalization of the SOA’s language.
|
|||
|
|
It is immediately clear from Table III(a) that results of RWR2 are on average
|
|||
|
|
better than those for RWR, and that they improve with increasing values of .
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:38
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
Table III.
|
|||
|
|
(a)
|
|||
|
|
|| = 5 || = 10
|
|||
|
|
target size 0.52
|
|||
|
|
0.67
|
|||
|
|
0
|
|||
|
|
|
|||
|
|
RWR
|
|||
|
|
|
|||
|
|
RWR
|
|||
|
|
|
|||
|
|
0.88
|
|||
|
|
0.80
|
|||
|
|
|
|||
|
|
0.98
|
|||
|
|
0.96
|
|||
|
|
|
|||
|
|
0.76
|
|||
|
|
0.73
|
|||
|
|
0.725
|
|||
|
|
0.722
|
|||
|
|
0.721
|
|||
|
|
0.720
|
|||
|
|
|
|||
|
|
0.95
|
|||
|
|
0.92
|
|||
|
|
0.916
|
|||
|
|
0.911
|
|||
|
|
0.908
|
|||
|
|
N/A
|
|||
|
|
|
|||
|
|
2
|
|||
|
|
|
|||
|
|
RWR
|
|||
|
|
|
|||
|
|
1
|
|||
|
|
2
|
|||
|
|
3
|
|||
|
|
4
|
|||
|
|
5
|
|||
|
|
∞
|
|||
|
|
|
|||
|
|
(b)
|
|||
|
|
RWR || = 5 || = 10
|
|||
|
|
|
|||
|
|
2
|
|||
|
|
|
|||
|
|
1
|
|||
|
|
2
|
|||
|
|
3
|
|||
|
|
4
|
|||
|
|
5
|
|||
|
|
∞
|
|||
|
|
|
|||
|
|
28.8%
|
|||
|
|
7.6%
|
|||
|
|
3.2%
|
|||
|
|
1.3%
|
|||
|
|
0.7%
|
|||
|
|
24.6%
|
|||
|
|
|
|||
|
|
46.3%
|
|||
|
|
7.3%
|
|||
|
|
1.2%
|
|||
|
|
0.0%
|
|||
|
|
0.0%
|
|||
|
|
N/A
|
|||
|
|
|
|||
|
|
(a) Average language size for RWR and RWR2 for various values of
|
|||
|
|
. = ∞ denotes an exhaustive exploration of all possible repairs.
|
|||
|
|
(b) Percentage of target expressions for which RWR outperforms RWR2 .
|
|||
|
|
|
|||
|
|
For expressions of alphabet size 5, we were able to consider all possible repairs,
|
|||
|
|
resulting in the entry for = ∞ in Table III(a). This represents the smallest
|
|||
|
|
language that includes the SOA’s language and that can be expressed by a
|
|||
|
|
SORE.
|
|||
|
|
Of course, the results in Table III(a) are averaged over 1000 randomly chosen
|
|||
|
|
SOAs. A more detailed analysis reveals that for a considerable number of SOAs,
|
|||
|
|
2
|
|||
|
|
RWR actually outperforms RWR for = 1. Table III(a) shows the number of
|
|||
|
|
2
|
|||
|
|
times RWR outperforms RWR for various values of . The probability that RWR
|
|||
|
|
outperforms RWR2 drops rapidly for increasing values of , especially for larger
|
|||
|
|
alphabet sizes. The last line in Table III(b) lists the probability that RWR derives
|
|||
|
|
the optimal result, that is, that the smallest language representable by a SORE
|
|||
|
|
is obtained for expressions of alphabet size 5.
|
|||
|
|
Although the RWR2 algorithm clearly outperforms RWR in terms of the language size of the derived expression, there is a compelling argument in the
|
|||
|
|
latter’s favor. In terms of running time, RWR outperforms RWR2 with a few orders of magnitude as is discussed in Section 7.5.
|
|||
|
|
7.3 Incomplete Data
|
|||
|
|
Unfortunately, in a real-world setting an available sample may simply contain
|
|||
|
|
too little information to learn the target regular expression. To formalize this,
|
|||
|
|
we introduce the notion of coverage.
|
|||
|
|
Definition 36. A sample S covers a deterministic automaton A if for every
|
|||
|
|
edge (s, t) in A there is a word w ∈ S whose unique accepting run in A traverses (s, t). Such a word w is called a witness for (s, t). A sample S covers a
|
|||
|
|
deterministic regular expression r if it covers the automaton obtained from S
|
|||
|
|
using the Glushkov construction for translating regular expressions into automata [Brüggeman-Klein 1993].
|
|||
|
|
If a sample S does not contain a witness for an edge, it may seem as if
|
|||
|
|
the target expression cannot be learned, even if it is a SORE since the SOA
|
|||
|
|
derived from the data has an edge missing. However, the repair rules introduce
|
|||
|
|
extra edges, so this part of the algorithm may actually alleviate the problem of
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:39
|
|||
|
|
|
|||
|
|
Table IV. Percentage of
|
|||
|
|
Successfully Derived Expressions
|
|||
|
|
at Various Values of Sample
|
|||
|
|
Coverage for CRX, RWR0 , RWR and
|
|||
|
|
2
|
|||
|
|
1
|
|||
|
|
|
|||
|
|
RWR
|
|||
|
|
|
|||
|
|
coverage CRX RWR0 RWR RWR21
|
|||
|
|
25.0
|
|||
|
|
85% 56% 12% 73%
|
|||
|
|
35.0
|
|||
|
|
87% 48% 32% 73%
|
|||
|
|
45.0
|
|||
|
|
96% 60% 57% 74%
|
|||
|
|
55.0
|
|||
|
|
87% 58% 63% 57%
|
|||
|
|
65.0
|
|||
|
|
82% 48% 58% 59%
|
|||
|
|
75.0
|
|||
|
|
80% 51% 51% 63%
|
|||
|
|
85.0
|
|||
|
|
63% 48% 47% 53%
|
|||
|
|
92.5
|
|||
|
|
57% 48% 47% 61%
|
|||
|
|
97.5
|
|||
|
|
85% 74% 64% 73%
|
|||
|
|
100.0
|
|||
|
|
100% 100% 100% 100%
|
|||
|
|
|
|||
|
|
incomplete data. This is indeed confirmed experimentally. It turns out that even
|
|||
|
|
with a substantial fraction of missing witnesses, the target regular expression
|
|||
|
|
can be learned with an astonishing degree of success. To quantify the missing
|
|||
|
|
information, we introduce the following definition:
|
|||
|
|
Definition 37. The coverage of a sample with respect to a target expression
|
|||
|
|
r is the ratio of the number of edges of the SOA derived from the sample and
|
|||
|
|
the SOA representing the target expression r.
|
|||
|
|
The tests were done on 100 real-world regular expressions of alphabet sizes
|
|||
|
|
up to 10, for 10 independently selected samples of varying coverage. The results are presented in Table IV. The straightforward CRX clearly outperforms all
|
|||
|
|
other algorithms, although this result should be approached with some caution:
|
|||
|
|
to give CRX a fair chance, the target expressions for this algorithm were limited
|
|||
|
|
to CHAREs, while the other algorithms were tested on general SOREs as well.
|
|||
|
|
Note that approximately 90% of real-world expressions are in fact CHAREs,
|
|||
|
|
hence its superior performance is not only due to simpler target expressions.
|
|||
|
|
The robustness of RWR21 is quite remarkable since it tends to derive more specific
|
|||
|
|
regular expressions than RWR0 and RWR. One would expect the generalization
|
|||
|
|
ability to decrease for algorithms that yield more specific results. This expectation is borne out when one compares RWR0 and RWR, however, RWR21 ’s greedy
|
|||
|
|
application of the repair rules seems to pay off in the context of incomplete data
|
|||
|
|
as well.
|
|||
|
|
7.4 Noise
|
|||
|
|
As already noted in the Introduction, real-world samples (such as XHTML)
|
|||
|
|
need not be valid with respect to its known schema. Errors crop up due to
|
|||
|
|
all sorts of circumstances. This underscores the need for a robust inference
|
|||
|
|
algorithm that can handle some noise in the input sample.
|
|||
|
|
Noise can come in several forms. To generate a noisy subsample, we modify
|
|||
|
|
the target expression either by replacing a symbol by a different one from the
|
|||
|
|
target’s expression, or by replacing it by a symbol that is not in the alphabet of
|
|||
|
|
the target expression. We than use the modified target expression to generate
|
|||
|
|
a complete sample. We define the noise level as follows.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:40
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
Definition 38. Given a target expression r, the noise level of a sample S is
|
|||
|
|
the ratio |S− L(r)|/|S|.
|
|||
|
|
Here we propose an approach to filter the sample S based on the probability
|
|||
|
|
of its words being generated by a probabilistic automaton, as we already used
|
|||
|
|
in previous work [Bex et al. 2008]. This probabilistic automaton has one state
|
|||
|
|
for each alphabet symbol, and the transition probabilities are computed using
|
|||
|
|
the Baum-Welsh algorithm [Rabiner 1989]. Given the probabilistic automaton,
|
|||
|
|
it is straightforward to compute the probability for each w ∈ S, so that one can
|
|||
|
|
rank the sample’s words. One expects words that contain noise, that is, that
|
|||
|
|
would be rejected by the target regular expression, to have low probability if
|
|||
|
|
their number is not excessively large compared to the sample’s size.
|
|||
|
|
To filter the sample, hoping to exclude those words that contain noise, we
|
|||
|
|
compute the mean μ and standard deviation σ of the sample’s probabilities. A
|
|||
|
|
string w ∈ S with probability P(w) is excluded if P(w) < μ − ασ . The factor α
|
|||
|
|
is a parameter of the algorithm. The filtered sample S is now used to derive
|
|||
|
|
a regular expression. It is of course possible that in the generation of S some
|
|||
|
|
words needed to derive the target expression were removed. Hence there is no
|
|||
|
|
guarantee that the derived regular expression will be an overapproximation of
|
|||
|
|
the target expression.
|
|||
|
|
Since it was shown in previous sections that RWR21 has the best overall performance, we focus solely on this algorithm in this section. In order to investigate
|
|||
|
|
how robust RWR21 is with respect to noise we applied the algorithm to samples S
|
|||
|
|
with increasing noise levels with a range of values for the cut-off α. We compute
|
|||
|
|
the precision and the recall for each individual expression and use the average
|
|||
|
|
values over many expressions to compute the F-value for a given noise level
|
|||
|
|
and cut-off so that the optimal cut-off point can be determined.
|
|||
|
|
To define precision and recall, consider the sample S = Svalid ∪ Sinvalid , where
|
|||
|
|
Svalid ⊆ S contains the words in S accepted by the target expression and Sinvalid
|
|||
|
|
contains the words in S not accepted by the target expression. A true positive is
|
|||
|
|
a word in Svalid that is accepted by the derived expression, while a false negative
|
|||
|
|
is a word in Svalid that is rejected by the derived expression. Similarly, a false
|
|||
|
|
positive is a word in Sinvalid that is accepted by the derived expression, while a
|
|||
|
|
true negative is a word in Sinvalid that is rejected by the derived expression. We
|
|||
|
|
denote by St.p. the set of true positives, by St.n. the set of true negatives, by Sf .p.
|
|||
|
|
the set of false positives, and by Sf .n. the set of false negatives.
|
|||
|
|
Definition 39. The precision p, recall r, and F-value of a derived regular
|
|||
|
|
expression on a sample S are given by
|
|||
|
|
p=
|
|||
|
|
|
|||
|
|
|St.p. |
|
|||
|
|
,
|
|||
|
|
(|St.p. | + |Sf .p. |)
|
|||
|
|
|
|||
|
|
r=
|
|||
|
|
|
|||
|
|
|St.p. |
|
|||
|
|
,
|
|||
|
|
(|St.p. | + |Sf .n. |)
|
|||
|
|
|
|||
|
|
F=
|
|||
|
|
|
|||
|
|
2 pr
|
|||
|
|
.
|
|||
|
|
p+r
|
|||
|
|
|
|||
|
|
Furthermore, we are interested in the fraction of derived regular expressions
|
|||
|
|
that is equivalent to the target expression.
|
|||
|
|
We average over 580 SOREs obtained from a corpus of real-world DTDs.
|
|||
|
|
The results are shown in Figure 16(a). From the F-value we can conclude
|
|||
|
|
that a cut-off value α F ≈ 0.7 yields the best balance between precision and
|
|||
|
|
recall. Figure 16(b) shows the fraction of derived regular expressions that is
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:41
|
|||
|
|
|
|||
|
|
Fig. 16. (a) F-value as a function of the cut-off value α for noise levels of 0.01 (squares), 0.02
|
|||
|
|
(circles), and 0.05 (triangles). (b) Fraction of derived expressions equivalent to the target expression
|
|||
|
|
as a function of the cut-off value α for noise levels of 0.01 (squares), 0.02 (circles), and 0.05
|
|||
|
|
(triangles).
|
|||
|
|
|
|||
|
|
equivalent to the target expression. For noise levels increasing from 0.01 to
|
|||
|
|
0.05, the F-value as well as the percentage of derived expressions equivalent
|
|||
|
|
to the target expression gradually decreases, as is to be expected. It should be
|
|||
|
|
noted that recall r < 1 implies that the language represented by the derived
|
|||
|
|
regular expression is not a superset of the target’s language. For the cut-off α F ,
|
|||
|
|
and a noise level of 0.01, approximately 16% of the derived regular expressions
|
|||
|
|
allow false negatives, while the value for a noise level of 0.05 is 15%. The fact
|
|||
|
|
that the derived expression is not a super-approximation may or may not be
|
|||
|
|
acceptable, depending on the application.
|
|||
|
|
Another interesting observation is that the number of derived expressions
|
|||
|
|
that is equivalent to the target expression increases beyond the cut-off value
|
|||
|
|
α F ; see Figure 16(b). For a noise level of 0.01, this trend continues up to
|
|||
|
|
cut-off values of αequiv. ≈ 0.3 where it reaches a maximum of approximately
|
|||
|
|
53%. However, at this value 20% of the derived regular expressions are not
|
|||
|
|
super-approximations to their target expressions. For α < αequiv. , the F-value
|
|||
|
|
decreases rapidly. For higher noise levels, the optimal cut-off value αequiv. is
|
|||
|
|
smaller, but since it is very unlikely that one knows the noise level, it is hard
|
|||
|
|
to take advantage of this fact by tuning αequiv. to a specific noise level. The
|
|||
|
|
overall best result will be obtained for αequiv. ≈ 0 for noise levels not exceeding
|
|||
|
|
0.05.
|
|||
|
|
It should be noted that for a noise level of 0.01 at αequiv. , out the 53% of derived
|
|||
|
|
regular expression that are equivalent to the target expression, about 7% is
|
|||
|
|
not covered by the sample. The latter illustrates once more the generalization
|
|||
|
|
ability of the algorithms RWR2 as was discussed in Section 7.3.
|
|||
|
|
7.5 Performance
|
|||
|
|
As mentioned previously, the one advantage RWR has over RWR2 is that the
|
|||
|
|
former’s running time is much lower than the latter’s. This is illustrated in
|
|||
|
|
Table V(a) for 1000 target expressions of alphabet size 10. It also shows the
|
|||
|
|
relative running time for RWR0 , illustrating that RWR outperforms both RWR0 and
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:42
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
Table V.
|
|||
|
|
(a)
|
|||
|
|
relative running time
|
|||
|
|
0
|
|||
|
|
RWR
|
|||
|
|
6 · 102
|
|||
|
|
2
|
|||
|
|
|
|||
|
|
RWR
|
|||
|
|
|
|||
|
|
1
|
|||
|
|
2
|
|||
|
|
3
|
|||
|
|
4
|
|||
|
|
5
|
|||
|
|
|
|||
|
|
2 · 102
|
|||
|
|
2 · 103
|
|||
|
|
1 · 104
|
|||
|
|
4 · 104
|
|||
|
|
1 · 105
|
|||
|
|
|
|||
|
|
(b)
|
|||
|
|
|| time (ms)
|
|||
|
|
5
|
|||
|
|
2
|
|||
|
|
10
|
|||
|
|
5
|
|||
|
|
15
|
|||
|
|
15
|
|||
|
|
20
|
|||
|
|
33
|
|||
|
|
50
|
|||
|
|
616
|
|||
|
|
100
|
|||
|
|
7562
|
|||
|
|
|
|||
|
|
(a) Relative running times of RWR2 versus RWR for various
|
|||
|
|
values of . (b) Average running times in milliseconds for RWR
|
|||
|
|
as a function of alphabet size.
|
|||
|
|
|
|||
|
|
2
|
|||
|
|
2
|
|||
|
|
RWR for any value of . However, it is interesting to note that RWR1 outperforms
|
|||
|
|
0
|
|||
|
|
RWR by a factor of 3, and derives more specific regular expressions, again
|
|||
|
|
illustrating the superiority of the new algorithms over RWR0 .
|
|||
|
|
|
|||
|
|
The performance of RWR is excellent: on average it takes only ms to derive
|
|||
|
|
an expression of alphabet size 10. Table V(b) shows actual running times as a
|
|||
|
|
function of the target expressions’ alphabet size, averaged over 1000 random
|
|||
|
|
expressions of that alphabet size.
|
|||
|
|
With respect to the performance in terms of the number of examples, we
|
|||
|
|
showed in the conference version of this article that RWR0 ’s was adequate to
|
|||
|
|
deal with large datasets. Example4 with 61 symbols in Table II is derived from
|
|||
|
|
10000 example words in 7 seconds while CRX only needs 3.2 seconds. More
|
|||
|
|
typical expressions of about 10 symbols derived from a few hundred examples
|
|||
|
|
take approximately a second. These figures include the time to initialize a
|
|||
|
|
Java Virtual Machine while the tests are done on a 2.5 GHz P4 with 1GB
|
|||
|
|
of RAM. Given that RWR and RWR21 outperform RWR0 and the time required to
|
|||
|
|
start the virtual machine and parse the data is independent of the algorithm,
|
|||
|
|
our new algorithms are adequate as well. For instance, RWR derived a DTD
|
|||
|
|
for PubMed from 10000 articles with a total size of over 1.2GB in 264 seconds
|
|||
|
|
(again including the time needed for Java initialization and parsing of the XML
|
|||
|
|
data). Trang slightly outperforms CRX thanks to very efficient XML parsing. We
|
|||
|
|
did not make a detailed comparison with XTRACT for the reason that XTRACT
|
|||
|
|
cannot handle samples with more than 1000 words.
|
|||
|
|
8. EXTENSIONS
|
|||
|
|
Incremental computation. Especially in the setting of sparse data when over
|
|||
|
|
time more XML data gets generated, for instance, by answers to queries or
|
|||
|
|
results of calls to Web services, it is desirable to update an already generated
|
|||
|
|
schema based on the newly arrived XML data only. Such an approach is possible
|
|||
|
|
for both RWR and CRX: as both algorithms make use of an internal representation
|
|||
|
|
(automata or partial orders), we only need to update that representation. So, for
|
|||
|
|
every element name we store the corresponding internal graph representation,
|
|||
|
|
which is only quadratic in the number of different element names, and we can
|
|||
|
|
forget about the XML data that generated it. Actually, for CRX, to assign the
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:43
|
|||
|
|
|
|||
|
|
qualifiers ?, + and ∗, we also need to remember for each element name how
|
|||
|
|
it occurs (always exactly once, always more than once, . . . ), but this is only a
|
|||
|
|
constant amount of information.
|
|||
|
|
Numerical predicates. An immediate drawback of SOREs is that they cannot count. For instance, they cannot express aabb+ specifying that a string
|
|||
|
|
should start with two a’s followed by any number of b’s larger than 1. XML
|
|||
|
|
Schema even uses dedicated attributes for expressing the desired number of
|
|||
|
|
repetitions.
|
|||
|
|
<xs:sequence>
|
|||
|
|
<xs:element name="a" minOccurs=2 maxOccurs=2/>
|
|||
|
|
<xs:element name="b" minOccurs=2 maxOccurs="unbounded"/>
|
|||
|
|
</xs:sequence>
|
|||
|
|
|
|||
|
|
In the same way, REs can be extended by numerical predicates: when r is
|
|||
|
|
an RE and i is a natural number then r ≥i and r =i are also REs. They are
|
|||
|
|
semantically equivalent to r i r ∗ and r i , respectively, where r i = r · r · · · · · r (i
|
|||
|
|
times). The preceding expression can then be expressed as a=2 b≥2 . To both RWR
|
|||
|
|
and CRX a post-processing step can be added that rewrites + and ∗ to numerical
|
|||
|
|
values based on exact occurrences of element names in the XML data.
|
|||
|
|
Generation of XSDs. While the inference of DTDs essentially reduces to the
|
|||
|
|
inference of regular expressions from sets of sample words (as illustrated in
|
|||
|
|
Section 1.1), the inference of XSDs is much more complex.
|
|||
|
|
Indeed, first and foremost, the content model of an element can only depend
|
|||
|
|
on the element’s name in a DTD. XML Schema, in contrast, has a typing
|
|||
|
|
mechanism that allows the content model of an element to depend not only on
|
|||
|
|
its name, but also on the context in which it is used. We refer the interested
|
|||
|
|
reader to Martens et al. [2006, 2007] for an in-depth discussion on the XML
|
|||
|
|
Schema typing mechanism and the extra expressive power that it provides with
|
|||
|
|
respect to DTDs. It is important to note, however, that the study of Martens
|
|||
|
|
et al. [2006] also shows that 85% of XSDs in practice does not use this additional
|
|||
|
|
power, and are hence structurally equivalent to a DTD. Obviously, inferring
|
|||
|
|
such XSDs is merely a matter of using the correct syntax. How to extend
|
|||
|
|
schema inference to deal with real XSDs that do use the additional power of
|
|||
|
|
the XML Schema typing system is studied in a companion article [Bex et al.
|
|||
|
|
2007].
|
|||
|
|
Second, DTDs have essentially only one atomic data type to describe the
|
|||
|
|
textual data found in XML documents: #PCDATA. XML Schema, in contrast, has
|
|||
|
|
atomic data types for numbers, strings, dates, etc. The algorithms described
|
|||
|
|
here can easily be extended with heuristics to recognize these atomic data
|
|||
|
|
types, such as the ones described by Hegewald et al. [2006].
|
|||
|
|
Inference of k-OREs. As the vast majority of expressions used in practical
|
|||
|
|
schemas are SOREs, we focused in this article on the inference of SOREs. In
|
|||
|
|
a companion article [Bex et al. 2008] we study the derivation of k-OREs, for
|
|||
|
|
small values of k, thus covering virtually all expressions occurring in practice.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:44
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
9. CONCLUSION
|
|||
|
|
We introduced novel algorithms for the inference of concise regular expressions
|
|||
|
|
from positive data. For the inference of SOREs, RWR2 was shown to yield the best
|
|||
|
|
experimental results. It is also quite robust when presented with incomplete
|
|||
|
|
and noisy data. The quality of inferred expressions on real-world and synthetic
|
|||
|
|
datasets outperforms those returned by XTRACT where CRX is similar to Trang.
|
|||
|
|
CRX’ generalization ability makes it highly qualified in dealing with very small
|
|||
|
|
datasets. Further, RWR, RWR2 , and CRX always infer succinct expressions by definition which can easily be interpreted by humans. Of independent interest, we
|
|||
|
|
introduced a new algorithm to transform automata into short, readable regular
|
|||
|
|
expressions.
|
|||
|
|
ELECTRONIC APPENDIX
|
|||
|
|
The electronic appendix for this article can be accessed in the ACM Digital
|
|||
|
|
Library.
|
|||
|
|
ACKNOWLEDGMENTS
|
|||
|
|
|
|||
|
|
We thank the authors of Garofalakis et al. [2003] for making available
|
|||
|
|
XTRACT’s source code, as well as Wouter Gelade for comments on a previous draft of this article.
|
|||
|
|
REFERENCES
|
|||
|
|
ABITEBOUL, S., BUNEMAN, P., AND SUCIU, D. 1999. Data on the Web. Morgan Kaufmann Publishers.
|
|||
|
|
AHONEN, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. thesis, Report A-1996-4. Department of Computer Science, University of
|
|||
|
|
Helsinki.
|
|||
|
|
ANGLUIN, D. AND SMITH, C. H. 1983. Inductive inference: Theory and methods. ACM Comput.
|
|||
|
|
Surv. 15, 3, 237–269.
|
|||
|
|
BARBOSA, D., MENDELZON, A. O., KEENLEYSIDE, J., AND LYONS, K. A. 2002. ToXgene: An extensible
|
|||
|
|
template-based data generator for XML. In Proceedings of the 5th International Workshop on the
|
|||
|
|
Web and Databases (WebDB 2002). 49–54.
|
|||
|
|
BARBOSA, D., MIGNET, L., AND VELTRI, P. 2006. Studying the XML web: Gathering statistics from
|
|||
|
|
an XML sample. World Wide Web 9, 2, 187–212.
|
|||
|
|
BENEDIKT, M., FAN, W., AND GEERTS, F. 2008. XPath satisfiability in the presence of DTDs. J.
|
|||
|
|
ACM 55, 2, 1–79.
|
|||
|
|
BERNSTEIN, P. A. 2003. Applying model management to classical meta data problems. In Online
|
|||
|
|
Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR’03).
|
|||
|
|
BEX, G. J., GELADE, W., NEVEN, F., AND VANSUMMEREN, S. Learning deterministic regular expressions
|
|||
|
|
for the inference of schemas from XML data. http://arxiv.org/abs/1004.2372.
|
|||
|
|
BEX, G. J., GELADE, W., NEVEN, F., AND VANSUMMEREN, S. 2008. Learning deterministic regular
|
|||
|
|
expressions for the inference of schemas from XML data. In Proceeding of the 17th International
|
|||
|
|
Conference on World Wide Web (WWW’08). 825–834.
|
|||
|
|
BEX, G. J., NEVEN, F., AND DEN BUSSCHE, J. V. 2004. DTDs versus XML Schema: A practical study.
|
|||
|
|
In Proceedings of the International Workshop on Web and Database (WebDB). S. Amer-Yahia and
|
|||
|
|
L. Gravano, Eds. 79–84.
|
|||
|
|
BEX, G. J., NEVEN, F., SCHWENTICK, T., AND TUYLS, K. 2006. Inference of concise DTDs from XML
|
|||
|
|
data. In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y.
|
|||
|
|
Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds.
|
|||
|
|
ACM, 115–126.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:45
|
|||
|
|
|
|||
|
|
BEX, G. J., NEVEN, F., AND VANSUMMEREN, S. 2007. Inferring XML schema definitions from XML
|
|||
|
|
data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07).
|
|||
|
|
998–1009.
|
|||
|
|
BRĀZMA, A. 1993. Efficient identification of regular expressions from representative examples. In
|
|||
|
|
Proceedings of the 6th Annual Conference on Computational Learning Theory (COLT’93). ACM
|
|||
|
|
Press, 236–242.
|
|||
|
|
BRÜGGEMAN-KLEIN, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2,
|
|||
|
|
197–213.
|
|||
|
|
BRÜGGEMANN-KLEIN, A. AND WOOD, D. 1998. One-Unambiguous regular languages. Inform. Comput. 140, 2, 229–253.
|
|||
|
|
BUNEMAN, P., DAVIDSON, S. B., FERNANDEZ, M. F., AND SUCIU, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT’97).
|
|||
|
|
Lecture Notes in Computer Science, vol. 1186. Springer, 336–350.
|
|||
|
|
CARON, P. AND ZIADI, D. 2000. Characterization of Glushkov automata. Theor. Comput. Sci. 233, 1–
|
|||
|
|
2, 75–90.
|
|||
|
|
Castor. The Castor project. www.castor.org.
|
|||
|
|
CHIDLOVSKII, B. 2001. Schema extraction from XML: A grammatical inference approach. In
|
|||
|
|
Proceedings of the 8th International Workshop on Knowledge Representation meets Databases
|
|||
|
|
(KRDB’01). CEUR Workshop Proceedings, vol. 45.
|
|||
|
|
CLARK,
|
|||
|
|
J.
|
|||
|
|
Trang:
|
|||
|
|
Multi-Format
|
|||
|
|
schema
|
|||
|
|
converter
|
|||
|
|
based
|
|||
|
|
on
|
|||
|
|
RELAX
|
|||
|
|
NG.
|
|||
|
|
www.thaiopensource.com/relaxng/trang.html.
|
|||
|
|
COVER, R. 2003. The Cover Pages. xml.coverpages.org.
|
|||
|
|
DELGADO, M. AND MORAIS, J. 2004. Approximation to the smallest regular expression for a given
|
|||
|
|
regular language. In Proceedings of the, 9th International Conference on Implementation and
|
|||
|
|
Application of Automata. Lecture Notes in Computer Science, vol. 3317. Springer, 312–314.
|
|||
|
|
DEUTSCH, A., FERNANDEZ, M. F., AND SUCIU, D. 1999. Storing semistructured data with STORED.
|
|||
|
|
In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM
|
|||
|
|
Press, 431–442.
|
|||
|
|
EHRENFEUCHT, A. AND ZEIGER, P. 1976. Complexity measures for regular expressions. J. Comput.
|
|||
|
|
Syst. Sci. 12, 134–146.
|
|||
|
|
FERNANDEZ, M. F. AND SUCIU, D. 1998. Optimizing regular path expressions using graph schemas.
|
|||
|
|
In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 14–
|
|||
|
|
23.
|
|||
|
|
FERNAU, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications.
|
|||
|
|
Lecture Notes in Artificial Intelligence, vol. 3264. Springer, 277–278.
|
|||
|
|
FERNAU, H. 2009. Algorithms for learning regular expressions from positive data. Inform. Comput. 207, 4, 521–541.
|
|||
|
|
FLORESCU, D. 2005. Managing semi-structured data. ACMQueue 3, 8, 18–24.
|
|||
|
|
GARCÍA, P. AND VIDAL, E. 1990. Inference of k-testable languages in the strict sense and application
|
|||
|
|
to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920–925.
|
|||
|
|
GAROFALAKIS, M., GIONIS, A., RASTOGI, R., SESHADRI, S., AND SHIM, K. 2003. XTRACT: Learning
|
|||
|
|
document type descriptors from XML document collections. Data Mining Knowl. Discov. 7, 23–
|
|||
|
|
56.
|
|||
|
|
GELADE, W. AND NEVEN, F. 2008. Succinctness of the complement and intersection of regular
|
|||
|
|
expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer
|
|||
|
|
Science (STACS’08). Dagstuhl Seminar Proceedings, vol. 08001. 325–336.
|
|||
|
|
GOLD, E. 1967. Language identification in the limit. Inform. Control 10, 5, 447–474.
|
|||
|
|
GOLDMAN, R. AND WIDOM, J. 1997. DataGuides: Enabling query formulation and optimization in
|
|||
|
|
semistructured databases. In Proceedings of the 23rd International Conference on Very Large
|
|||
|
|
Data Bases (VLDB’97). 436–445.
|
|||
|
|
GRUBER, H. AND HOLZER, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the 35th International Colloquium on Automata, Languages and
|
|||
|
|
Programming. Lecture Notes in Computer Science, vol. 5126. Springer, 39–50.
|
|||
|
|
HAN, Y.-S. AND WOOD, D. 2007. Obtaining shorter regular expressions from finite-state automata.
|
|||
|
|
Theor. Comput. Sci. 370, 1–3, 110–120.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
11:46
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
G. J. Bex et al.
|
|||
|
|
|
|||
|
|
HEGEWALD, J., NAUMANN, F., AND WEIS, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the 22nd International Conference on Data
|
|||
|
|
Engineering Workshops (ICDEW’06). IEEE Computer Society, 81–97.
|
|||
|
|
HINKELMAN, S. 2005. Business integration—Information conformance statements (BI-ICS). Tech.
|
|||
|
|
rep., IBM DeveloperWorks.
|
|||
|
|
HOPCROFT, J. AND ULLMAN, J. 1979. Introduction to Automata Theory, Languages and computation.
|
|||
|
|
Addison-Wesley.
|
|||
|
|
HUET, G. 1980. Confluent reductions: Abstract properties and applications to term rewriting
|
|||
|
|
systems. J. ACM 27, 4, 797–821.
|
|||
|
|
KOCH, C., SCHERZINGER, S., SCHWEIKARDT, N., AND STEGMAIER, B. 2004. Schema-Based scheduling of
|
|||
|
|
event processors and buffer minimization for queries on structured data streams. In Proceedings
|
|||
|
|
of the 30th International Conference on Very Large Data Bases (VLDB’04). 228–239.
|
|||
|
|
MANOLESCU, I., FLORESCU, D., AND KOSSMANN, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases
|
|||
|
|
(VLDB’01). 241–250.
|
|||
|
|
MARTENS, W., NEVEN, F., AND SCHWENTICK, T. 2007. Simple off the shelf abstractions for XML
|
|||
|
|
schema. SIGMOD Rec. 36, 3, 15–22.
|
|||
|
|
MARTENS, W., NEVEN, F., SCHWENTICK, T., AND BEX, G. J. 2006. Expressiveness and complexity of
|
|||
|
|
XML schema. ACM Trans. Data. Syst. 31, 3.
|
|||
|
|
MCHUGH, J., ABITEBOUL, S., GOLDMAN, R., QUASS, D., AND WIDOM, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54–66.
|
|||
|
|
MELNIK, S. 2004. Generic model management: Concepts and algorithms. Ph.D. thesis, University
|
|||
|
|
of Leipzig.
|
|||
|
|
MIGNET, L., BARBOSA, D., AND VELTRI, P. 2003. The XML web: A first study. In Proceedings of the
|
|||
|
|
12th International World Wide Web Conference. 500–510.
|
|||
|
|
MIKLAU, G. 2002. XMLData repository. www.cs.washington.edu/research/xmldatasets.
|
|||
|
|
MIN, J.-K., AHN, J.-Y., AND CHUNG, C.-W. 2003. Efficient extraction of schemas for XML documents.
|
|||
|
|
Inform. Process. Lett. 85, 1, 7–12.
|
|||
|
|
NESTOROV, S., ABITEBOUL, S., AND MOTWANI, R. 1998. Extracting schema from semistructured data.
|
|||
|
|
In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM
|
|||
|
|
Press, 295–306.
|
|||
|
|
NESTOROV, S., ULLMAN, J. D., WIENER, J. L., AND CHAWATHE, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International
|
|||
|
|
Conference on Data Engineering. IEEE Computer Society, 79–90.
|
|||
|
|
NEVEN, F. AND SCHWENTICK, T. 2006. On the complexity of XPath containment in the presence of
|
|||
|
|
disjunction, DTDs, and variables. Logical Methods Comput. Sci. 2, 3.
|
|||
|
|
NGU, A. H. H., ROCCO, D., CRITCHLOW, T., AND BUTTLER, D. 2005. Automatic discovery and inferencing of complex bioinformatics web interfaces. World Wide Web 8, 4, 463–493.
|
|||
|
|
OAKS, P. AND TER HOFSTEDE, A. H. M. 2007. Guided interaction: A mechanism to enable ad hoc
|
|||
|
|
service interaction. Inform. Syst. Frontiers 9, 1, 29–51.
|
|||
|
|
OHLEBUSCH, E. 2001. Implementing conditional term rewriting by graph rewriting. Theor. Comput. Sci. 262, 1, 311–331.
|
|||
|
|
OPEN WEB APPLICATION SECURITY PROJECT CONSORTIUM. 2004. The top ten most critical web application security vulnerabilities—2004 update. www.owasp.org.
|
|||
|
|
PITT, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the
|
|||
|
|
International Workshop on Analogical and Inductive Inference (AII’89). Springer-Verlag, 18–
|
|||
|
|
44.
|
|||
|
|
RABINER, L. 1989. A tutorial on hidden Markov models and selected applications in speech
|
|||
|
|
recognition. Proc. IEEE 77, 2, 257–286.
|
|||
|
|
RAHM, E. AND BERNSTEIN, P. A. 2001. A survey of approaches to automatic schema matching.
|
|||
|
|
VLDB J. 10, 4, 334–350.
|
|||
|
|
SAHUGUET, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask
|
|||
|
|
(extended abstract). In Proceedings of the 3rd International Workshop on The World Wide Web
|
|||
|
|
and Databases, (WebDB’00), Selected Papers. 171–183.
|
|||
|
|
SAKAKIBARA, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1,
|
|||
|
|
15–45.
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
Inference of Concise Regular Expressions and DTDs
|
|||
|
|
|
|||
|
|
•
|
|||
|
|
|
|||
|
|
11:47
|
|||
|
|
|
|||
|
|
SANKEY, J. AND WONG, R. K. 2001. Structural inference for semistructured data. In Proceedings of
|
|||
|
|
the International Conference on Information and Knowledge Management. ACM Press, 159–166.
|
|||
|
|
Sun. Sun JAXB. java.sun.com/webservices/jaxb.
|
|||
|
|
THOMPSON, H. S., BEECH, D., MALONEY, M., AND MENDELSOHN, N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium, Recommendation REC-xmlschema-1-20041028.
|
|||
|
|
W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language, 2nd Ed. W3C.
|
|||
|
|
WANG, G., LIU, M., YU, J. X., SUN, B., YU, G., LV, J., AND LU, H. 2003. Effective schema-based XML
|
|||
|
|
query optimization techniques. In Proceedings of the 7th International Database Engineering
|
|||
|
|
and Applications Symposium. 230–235.
|
|||
|
|
Received January 2009; revised July 2009; accepted November 2009
|
|||
|
|
|
|||
|
|
ACM Transactions on Database Systems, Vol. 35, No. 2, Article 11, Publication date: April 2010.
|
|||
|
|
|
|||
|
|
|