[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RELAX NG and W3C XML Schema



I just had a look at draft-hollenbeck-ietf-xml-guidelines-04.  Section
4.6 says "XML Schema should be used as the formalism in the absence of
clearly stated reasons to choose another."  I strongly disagree with
this recommendation.

I believe RELAX NG is preferable in many situations to XML Schema and
should receive at least equal billing.  Concretely, I propose in the
sentence above changing "XML Schema" to "XML Schema or RELAX NG".

Currently, 4.6 mentions RELAX NG in the following terms: "There are
also a number of other mechanisms for describing XML instance
validity; these include, for example, Schematron [48], RELAX NG [47],
and the Document Schema Definition Language [34]."  Firstly, this
mentions RELAX NG and DSDL as if they were separate things.  This is
incorrect.  RELAX NG is in fact Part 2 of DSDL (which now stands for
Document Schema Definition Language*s*).  I don't think RELAX NG is
just another mechanism.  It is a solid, mature and stable
specification.  It has been developed in an open standards process (in
OASIS).  It has multiple, independent and interoperable
implementations.  It is based on a solid body of CS theory (tree
automata). It is on track to become a fully-fledged International
Standard: it recently went out as a Draft International Standard [1].

Certainly no one can deny that at this point W3C XML Schema enjoys
much greater acceptance in the marketplace.  However, I would argue
this should not be the key criteria to use to select which schema
languages to recommend for use in IETF specifications.  I believe the
key function of a schema language in a specification of an XML
application is to communicate unambiguously and precisely to a human
reader what XML documents are legal for that application; it serves a
similar role for XML that ABNF does for text.  Thus, the
key criteria should be how well the schema language performs this
function.

On this criteria, there are many reasons to prefer RELAX NG.

1. RELAX NG was designed to be simple and easy to understand.  RELAX
NG is simple enough that without even reading the RELAX NG spec,
somebody familiar with XML can read a RELAX NG grammar and understand
what it means.  You can learn to write RELAX NG in 30 minutes by
reading the tutorial [2].  RELAX NG is fairly free of surprises.
Constructs mean what you would guess they mean.

This is not the case with W3C XML Schema.  It requires considerable
expertise to be able to understand a W3C XML Schema correctly.  There
are many cases where you cannot guess what a construct means or where
you might guess wrong.  For example, if you derive a complex type by
restriction you have to specify the new restricted content model
explicitly.  However, attributes are treated in the opposite way: by
default you get all the attributes and you have to explicitly rule out
the ones you get.  This may be more convenient but it make for schemas
that can be easily misunderstood by the uninitiated: somebody who is
not an expert, seeing a restriction with a content model but no
attributes, might well assume that no attributes were allowed.  This
is not an isolated example.

There are many things about XML Schema that are just plain bizarre.
Here's a random example I ran across yesterday.  Suppose you have two
attribute groups g1 and g2, containing sets of attributes a1 and a2
and attribute wildcards w1 and w2.  Now suppose you have a complex
type t that references g1 and g2.  The effective attributes of t will,
as you would expect, be the union of a1 and a2, but the attribute
wildcards will be the *intersection* of w1 and w2. For example, given

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
  elementFormDefault="qualified"
  xmlns="http://eg.com";
  targetNamespace="http://eg.com";>

<xs:attributeGroup name="g1">
  <xs:attribute name="a1" type="xs:string"/>
  <xs:anyAttribute namespace="http://eg.com/1 http://eg.com/2";
       processContents="skip"/>
</xs:attributeGroup>

<xs:attributeGroup name="g2">
  <xs:attribute name="a2" type="xs:string"/>
  <xs:anyAttribute namespace="http://eg.com/2"; processContents="skip"/>
</xs:attributeGroup>

<xs:element name="foo">
  <xs:complexType>
    <xs:attributeGroup ref="g1"/>
    <xs:attributeGroup ref="g2"/>
  </xs:complexType>
</xs:element>

</xs:schema>

the foo element could have an a1 attribute or an a2 attribute or any
attribute from the http://eg.com/2 namespace, but could not have
attributes from the http://eg.com/1 namespace.

Maybe there's some good reason behind this, but I believe this sort of
design decision makes W3C XML Schema a very poor choice as a formalism
for communicating an XML grammar to a human reader.

2. The problem described in 1 above might be tolerable if the W3C XML
Schema Recommendation [3] were easy to understand. However, it is
without doubt the hardest to understand specification that I have ever
read.  In order to be able to understand the precise meaning of a
schema in in an IETF specification, readers would have to consult the
W3C XML Schema Recommendation.  But it is extraordinarily hard for a
reader to determine from the Recommendation the meaning of some
particular construct they are not sure of.

I often hear people say: "It doesn't really matter that the spec W3C
XML Schema Rec is so hard to understand; only W3C XML Schema
implementors need to do this". I think this is misguided.  People who
want to be sure they have understood exactly what a particular W3C XML
schema means also have to understand the W3C XML Schema Rec.

3. The RELAX NG specification includes a normative, formal description
of the semantics of a RELAX NG schema. This was not developed as an
afterthought but was a guide throughout the design of the semantics.
More than a year after the publication of the W3C XML Schema
Recommendation, "XML Schema: Formal Description" [4] is still a work
in progress and is still far from being a complete and correct
description of the semantics of XML Schema; moreover, it cannot be
relied on as it has no normative force.

The RELAX NG formalism has a solid basis in tree automata theory.  W3C
XML Schema has no such basis.

The role of a schema in a specification is to serve as a formalism.
How good is a formalism if that formalism itself lacks a proper formal
basis?

4. W3C XML Schema's support for attributes is totally inadequate and
provides no advance over DTDs.  As with DTDs, W3C XML Schema only
allows the specification of whether attributes are required or
optional.  There is no way to specify more complex constraints between
attributes or between attributes or elements.  There is no way to say
that either attribute X or attribute Y is allowed or that either
attribute X or element Y is allowed.  In my experience, this sort of
constraint is extremely common in XML grammars.

RELAX NG integrates attributes into content models.  Exactly the same
mechanism that is used to constrain the cooccurrence of child elements
can be used to constrain the cooccurrence of attributes and the
cooccurrence of attributes and child elements.

5. W3C XML Schema provides very weak support for unordered content.
When the designer of an XML vocabulary does not wish to force child
elements to occur in a particular order, it can be impractical to
describe the XML vocabulary using XML Schema, because XML Schema
imposes such limitations on its "all" element as to make it virtually
useless.

RELAX NG provides an "interleave" element, which is restricted
enough to be efficiently implementable but provides adequate support
for designers who do no wish to allow flexibility in the ordering of
child elements.

6. The approach to handling datatypes in W3C XML Schema is totally
lacking in modularity.  W3C XML Schema is tied to the single
collection of datatypes defined in Part 2 of W3C XML Schema. Yet this
collection of datatypes is a very ad-hoc collection. It includes datatypes
of
highly debatable utility (gYearMonth, gDay etc).  Yet it lacks many
datatypes that are important for many applications.

I would argue that no one single collection of datatypes can be
adequate for all applications across the diverse range of domains
supported by XML.  What's needed is a modular approach where a schema
language for specifying structure can be combined with one or more
standard collections of datatypes, some general-purpose and some
domain-specific.  RELAX NG adopts this approach.  You can the
datatypes defined by W3C XML Schema if you choose, but it is also
possible to use other systems of datatypes instead of or in addition
to these.

With RELAX NG, an IETF specification could define a collection of
datatypes that are useful for IETF applications.  For example, might
it not be useful to have a datatype for an IP address or a domain
name? Such datatypes could be used with RELAX NG with no change to
RELAX NG itself.

7. In W3C XML Schema there is no way to specify what is allowed as the
root element.  W3C XML Schema does not define a single notion of
validity of a document with respect to a schema.  There are different
varieties of validation (lax and strict) and many different ways to
validate a document against a schema.  From a W3C XML Schema alone, it
is not possible to know what it is a valid document.

For example, consider a totally trivial schema like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
  elementFormDefault="qualified"
  xmlns="http://www.example.com";
  targetNamespace="http://www.example.com";>

<xs:element name="foo">
  <xs:complexType/>
</xs:element>

</xs:schema>

Now consider a totally bogus document like this:

<bar/>

Believe it or not, the W3C XML Schema processors that I have tried
report this as valid!  The definition of validity is so flexible in
W3C XML Schema as to seriously impact interoperability.  If an
application was relying on the W3C XML Schema validation to screeen
out incorrect input, it would be in serious trouble.

With RELAX NG, this sort of bogosity does not arise: there is a clear,
unambiguous notion of validity.  If you have a RELAX NG schema, there
is no doubt about what instances are valid.

8. W3C XML Schema provides the xsi:schemaLocation attribute, which
allows an XML document instance to indicate the schema that should be
used to validate the document.  I think this is a serious problem for
a couple of reasons.

One reason is that this is a potential security problem.  One
important use schemas is to protect an application against invalid
data.  This use of schemas is easily undermined by documents that use
xsi:schemaLocation.

Another reason is that this leads to interoperability problems.  Its
use is not mandated by the XML Rec: it's just a hint.  Yet, in some
implementations, this is the only way to specify the schema to use to
validate the document.

In RELAX NG, validation is treated as a process with two independent
inputs, a schema and an instance to be validated with respect to the
schema.

There is no way in a W3C XML Schema to prohibit the instance from
containing xsi:schemaLocation attributes.  Indeed, this is also the
case for other xsi attributes: there is no way to prevent the document
containing xsi:type attributes.  The use of W3C XML Schema infects the
grammar you are defining. If you want a closed grammar that only
allows specific attributes not including the xsi attributes, you
cannot express that in W3C XML Schema.  RELAX NG has no such magic
attributes.

9. Another problematic area in W3C XML Schema is the support for
infoset augmentation, such a default attributes.  Experience with XML
1.0 has, I believe, shown that this is not a good feature to include
in a schema language.  Apart from being a violation of modularity, it
tends to cause interoperability problems, because it leads to the
possibility of the application getting different information depending
on whether or not validation has been performed.  RELAX NG, by
contrast, never changes the information that an application receives.
It specifies purely what is valid and what is invalid.

I've looked through the archives and I haven't seen any technical
justification for the recommendation of W3C XML Schema as the default
choice of schema language.

Section 1.2 of RFC 2026 lists as two of the goals of the Internet
Standards Process:

- technical excellence
- clear, concise and easily understood documentation

I believe these should be considered in selecting a schema
language. On both of these, I believe RELAX NG is far superior to W3C
XML Schema.  I invite anybody who disagrees to go off and read the two
specifications [3], [5].

I am sorry to have gone on at such length, but I think this is an
important issue.  There seems to be a tendency for people to suspend
their technical judgment when it comes to W3C XML Schema. The attitude
seems to be "It's a W3C Recommendation; everybody is using it, so we
should too, regardless of its technical merits."  I don't think this
attitude serves the best long-term interests of the Internet.  I and
others have sacrificed a huge amount of time and effort to try and
provide the community with a solid, technically credible alternative
and I think it deserves to be considered seriously on its technical
merits and not dismissed on the basis of its current level of market
acceptance.

James

[1] http://www.y12.doe.gov/sgml/sc34/document/0320.htm
[2] http://www.oasis-open.org/committees/relax-ng/tutorial-20011203.html
[3] http://www.w3.org/TR/xmlschema-1
[4] http://www.w3.org/TR/xmlschema-formal/
[5] http://www.oasis-open.org/committees/relax-ng/spec-20011203.html