I'd like to respond to James Clark's posting
about usage of XML Schema in the IETF. First of all, let me summarize my
experience with Schema:
* We've been
discussing the issues of using XML Schema a lot in the WebDAV working groups,
as this is clearly a next step we need to look into. Just as an FYI,
while many of the implementors on that list have brought up the topic of
moving from DTDs to Schema, nobody to date has brought up the idea of Relax NG
or any other XML formalism.
* We've incorporated XML Schema support into the
Oracle database in version 9iR2, and so we've been educating lots of people,
>from developers, marketing, education and beta customers, on how to use
XML Schema.
My anecdotal observations so far:
* The W3C Schema Recommendations (part 1 & 2)
are inaccessible to most developers except for a core group of our schema
implementors. You cannot learn how to write schemas from these
specs.
* Most people learn XML Schema from the primer
(part 0), which is very well written and very accessible. There are also
a lot of good books out there (there's an O'Reilly pamphlet that's really
useful). Once they learn Schema from these documents, they operate at a
pretty high level of facility. Schema kind of "feels like" type
definition languages they've seen before, like SQL DDL or Java
classes.
* Most people cannot learn XSLT or other W3C
standards from the W3C specifications, either. However, the problem with
XSLT (for example) is not so much the writing of the spec (which is much less
obtuse than the Schema spec) but the mindset required to write it
effectively--the recursive templating nature of the XSLT language is pretty
difficult to master. So, XSLT is easier to learn, but harder to
use. Also, because of this paradigm difference, there are major
performance problems with XSLT compared to more traditional paradigms like
JSPs in actual customer applications.
* The number one barrier to acceptance of any
formalism more complete than DTDs is market acceptance--how many people have
bothered to even get a simple understanding of it. Maybe about half
of the people on the DAV WG know enough Schema to be able to review specs
written in the language. I don't think anybody on the WG knows enough
RELAX NG to where we could use it. The earlier versions of the WebDAV
ACL spec I wrote using XML Schema were rewritten into DTDs just to promote
greater awareness.
* Most of our implementors are using tools like
XML Spy to develop schemas
My thoughts:
* XML Schema was developed as a compromise
between the data-oriented people (like Oracle) and the document-oriented
people. It has some of the problems that occur when things are designed
by committee to meet the needs of multiple constituencies. However,
that's also why it has a lot more market acceptance.
* If I were comparing XML structure definition
languages to programming languages, I would say XML Schema is like C++, and
Relax NG is like Lisp. C++ was another one of those languages designed
by committee, and has some of the same problems around the edges (what happens
again to the object's allocation when I throw an exception from a constructor,
or using virtual methods in a constructor??). However, I don't think
that the problems of C++ carry over as much in schema definition, as graphical
tools are available to construct schemas that work pretty well, whereas code
generation doesn't really work. I expect that XML Schema & Relax NG
will see about the same level of market acceptance as C++ and Lisp did,
respectively, in the 90's, with possibly more dominance for Schema, because of
the abilities of tools to mitigate the complexity.
* I think the number one thing that is important
for the IETF in recommending an XML structure language is market acceptance,
since it is the content of the protocol definition that needs to be
reviewed, and we need to use a language that has as many people as possible
conversant with it.
* XML Schema feels more data-oriented while Relax
NG feels more document oriented. However, most IETF protocols are
basically pretty structured data-oriented, but with variances in the structure
that benefit from XML. I worry that Relax NG validation performance will
compare to Schema validation performance in the same way that XSLT compares to
JSPs.
* While Relax NG went through a standards
process, it didn't have a lot of participation in the process, and I don't
believe it meets the needs of all the potential consituencies as well as
Schema does due to that lack of participation. It's kind of like the
SQL99 standards for multimedia or objects in databases--it's an ISO/ANSI
standard that Oracle moved through the process mostly by itself, but
most vendors don't implement it.
* The #1 problem with using XML in IETF
protocols, in my opinion, is not being able to put binary data in
directly. It would certainly be possible to add something like
chunked-transfer-encoding in XML 2.0 (Core), and I think if the IETF is going
to criticize the work of the W3C, that would be a more useful avenue than
criticizing Schema. I would love to say <content
length="2e45">binary stuff</content> in my protocol messages rather
than forcing a base64 encoding.
* DTDs are definitely not good enough to express
the XML needed in protocol definitions. There are WAY too many "any"
declarations in the DTDs we use in WebDAV, not to mention the need for
primitive datatypes.
To James's specific criticisms of XML Schema, I
would say they fall into four camps:
A) XML Schema Spec is hard to read, and is
unintuitive to the uninitiated (points #1 & 2)
B) XML Schema spec is missing some
features that James wants (e.g. co-occurence constraints for attributes,
notation for unordered content, more flexibility for the <all> group,
constraints for root element) (points #3, 4, 5, 7)
C) XML Schema has poor abstractions
(points 6, 9)
D) unfounded criticism (point 8) -- more on
this later ;-)
Most of James's criticisms are valid in
themselves, but I don't think that they matter that much in the big
picture.
* Point A: the primer is much easier to
read than most specs, and most questions about legality in Schema can be
answered by the primer, at least for most users. There is a supporting
material already for Schema (e.g. O'Reilly booklet) that is also pretty good,
and I think the supporting material fixes this problem. So I don't think
this presents a practical problem for most Schema users. If this were a
problem, then Schema wouldn't have the market acceptance that it does.
Unreadability of the Schema spec is only a problem if it limits market
acceptance.
* Point B: any time you freeze a
specification, you do so with some set of features that is less than what some
people would desire. Schema froze its spec much earlier than Relax
NG. Relax NG specifically addressed many of the weaknesses of Schema as
it got "close" to a W3C recommendation. Schema 2.0 will address these
issues, plus I'm sure build on what we learn from Relax NG as well as customer
feedback. Also, there were good reasons for not adding in some of these
features. I know that some of the restrictions for the <all> group
were there because of performance difficulty that streaming processors would
have. My contention would be that Schema is too feature-rich for version
1.0, not too feature-poor, which is what James suggests. While I would
be very happy not to have to implement redefine or key/keyref in Schema, those
features were put in SPECIFICALLY TO MAXIMIZE MARKET ACCEPTANCE.
* Point C: I don't think the abstraction
issues James raises are that significant. First of all (point #6), lots
of very successful type systems (SQL, Java, C/C++) have builtin primitive
types as distinct from constructed types. It allows for more
implementation optimizations.
The reason default attributes were added was
because a lot of people want and need them in typical applications. Now,
I do think that there is an inherent conflict between the way that structured
applications & unstructured applications want to access XML data.
Structured data access (like Java Beans) wants to access a named item known at
compile time, usually without regard to ordering. When I say
"webdav.resource.setModDate()" I want it to work regardless of the ordering
constraints. However, sometimes code written to order-aware APIs like
DOM has to interact with order-unaware code (like JSR-31-JAXB). What we
do in the Oracle implementation is to analyze the co-occurence constraints at
schema compilation time to figure out if it is computationally simple to
figure where newly added elements are allowed to go when the ordering is
unspecified, and disallow unordered access (e.g. via JAXB or relational SQL
views) to documents conforming to schemas that are "too complicated".
The reason that there are no default elements (which my customers would like)
is that default elements values cannot be specified in general without
information as to the ordering. James's solution to the problem is to
get rid of defaulting (a feature which has been a must in pretty much every
database implementation ever deployed). I would suggest defining a
subset of co-occurence constraints that allow for unordered access (e.g. if
there are no maxoccurs > 1 on any sequence or choice model used for an
element, it is easy to figure out the order things go in). The market
clearly wants default values.
* Point D: James is complaining about the
limitations of current implementations with respect to their handling of
xsi:schemaLocation. This is clearly not a problem with the Schema
spec. I think the current Schema implementations (given their existence)
are better than most Relax NG implementations (which are much worse, since
they don't exist). However, my experience has been that it is very nice
to use the schemaLocation tag, because without it, instances don't know what
type they are. If you say that validation is a process requiring both an
instance and a schema, this doesn't interoperate well with most IETF standards
that only refer to an instance (via a URL) and where there is no standard
way to specify the schema separately. Having instances know what type
they are allows for lots of optimizations, such as compilation of instances
that conform to a particular schema definition.
My conclusions: (Disclaimer: I don't know Relax
NG very well--just a onceover of the spec, but as James's argument rests
mostly on the faults of XML Schema, I can address those well)
XML Schema is a better language for IETF
standards for the following reasons:
* It has (and will continue to have) greater
market acceptance than alternatives like Relax NG, and getting the maximum
number of people to review the protocol definitions is more important than
dealing with inconsitencies in the schema language abstractions that only come
up in corner cases that nobody needs in IETF protocol standards. Market
acceptance has always been the primary focus of IETF standards work (look at
HTTP for Pete's sake), not purity of abstraction
* Schema is more data-centric, and is more
natural for protocol data.
* A lot more work has been done on optimization
and performance of schemas than Relax NG, and I believe that performance of
validation will be a primary concern for IETF protocol implementations.
At Oracle, we've been working on XML Schema compilation for 2 years.
While I don't think we have the implementation experience to demonstrate
either way, my belief is that performance of Schema validation vs. RelaxNG
will track market acceptance.
* I don't think the bugs or missing features in
Schema will affect protocol work in any way. Most of the features in
Schema (inheritance, substitution groups, key/keyref) are unlikely to be used
in IETF recommendations.
* I think we understand the limitations of Schema
better
I'd hate to see the perfect become the enemy of
the good here.
--Eric Sedlar
P.S. Please CC me directly on any
replies--I'm not on this mailing list yet. Thanks.