I'd like to respond to James Clark's
posting about usage of XML Schema in the IETF. First of all,
let me summarize my experience with Schema:
* We've been discussing the issues of using XML Schema a lot in
the WebDAV working groups, as this is clearly a next step we need to
look into. Just as an FYI, while many of the implementors on
that list have brought up the topic of moving from DTDs to Schema,
nobody to date has brought up the idea of Relax NG or any other XML
formalism.
* We've incorporated XML Schema support
into the Oracle database in version 9iR2, and so we've been
educating lots of people, >from developers, marketing, education
and beta customers, on how to use XML Schema.
My anecdotal observations so
far:
* The W3C Schema Recommendations (part
1 & 2) are inaccessible to most developers except for a core
group of our schema implementors. You cannot learn how to
write schemas from these specs.
* Most people learn XML Schema >from
the primer (part 0), which is very well written and very
accessible. There are also a lot of good books out there
(there's an O'Reilly pamphlet that's really useful). Once they
learn Schema >from these documents, they operate at a pretty high
level of facility. Schema kind of "feels like" type definition
languages they've seen before, like SQL DDL or Java
classes.
* Most people cannot learn XSLT or
other W3C standards from the W3C specifications, either.
However, the problem with XSLT (for example) is not so much the
writing of the spec (which is much less obtuse than the Schema spec)
but the mindset required to write it effectively--the recursive
templating nature of the XSLT language is pretty difficult to
master. So, XSLT is easier to learn, but harder to use.
Also, because of this paradigm difference, there are major
performance problems with XSLT compared to more traditional
paradigms like JSPs in actual customer applications.
* The number one barrier to acceptance
of any formalism more complete than DTDs is market acceptance--how
many people have bothered to even get a simple understanding of
it. Maybe about half of the people on the DAV WG know
enough Schema to be able to review specs written in the
language. I don't think anybody on the WG knows enough RELAX
NG to where we could use it. The earlier versions of the
WebDAV ACL spec I wrote using XML Schema were rewritten into DTDs
just to promote greater awareness.
* Most of our implementors are using
tools like XML Spy to develop schemas
My thoughts:
* XML Schema was developed as a
compromise between the data-oriented people (like Oracle) and the
document-oriented people. It has some of the problems that
occur when things are designed by committee to meet the needs of
multiple constituencies. However, that's also why it has a lot
more market acceptance.
* If I were comparing XML structure
definition languages to programming languages, I would say XML
Schema is like C++, and Relax NG is like Lisp. C++ was another
one of those languages designed by committee, and has some of the
same problems around the edges (what happens again to the object's
allocation when I throw an exception >from a constructor, or
using virtual methods in a constructor??). However, I don't
think that the problems of C++ carry over as much in schema
definition, as graphical tools are available to construct schemas
that work pretty well, whereas code generation doesn't really
work. I expect that XML Schema & Relax NG will see about
the same level of market acceptance as C++ and Lisp did,
respectively, in the 90's, with possibly more dominance for Schema,
because of the abilities of tools to mitigate the
complexity.
* I think the number one thing that is
important for the IETF in recommending an XML structure language is
market acceptance, since it is the content of the protocol
definition that needs to be reviewed, and we need to use a
language that has as many people as possible conversant with
it.
* XML Schema feels more data-oriented
while Relax NG feels more document oriented. However, most
IETF protocols are basically pretty structured data-oriented, but
with variances in the structure that benefit from XML. I worry
that Relax NG validation performance will compare to Schema
validation performance in the same way that XSLT compares to
JSPs.
* While Relax NG went through a
standards process, it didn't have a lot of participation in the
process, and I don't believe it meets the needs of all the potential
consituencies as well as Schema does due to that lack of
participation. It's kind of like the SQL99 standards for
multimedia or objects in databases--it's an ISO/ANSI standard that
Oracle moved through the process mostly by itself, but
most vendors don't implement it.
* The #1 problem with using XML in IETF
protocols, in my opinion, is not being able to put binary data in
directly. It would certainly be possible to add something like
chunked-transfer-encoding in XML 2.0 (Core), and I think if the IETF
is going to criticize the work of the W3C, that would be a more
useful avenue than criticizing Schema. I would love to say
<content length="2e45">binary stuff</content> in my
protocol messages rather than forcing a base64
encoding.
* DTDs are definitely not good enough
to express the XML needed in protocol definitions. There are
WAY too many "any" declarations in the DTDs we use in WebDAV, not to
mention the need for primitive datatypes.
To James's specific criticisms of XML
Schema, I would say they fall into four camps:
A) XML Schema Spec is hard to
read, and is unintuitive to the uninitiated (points #1 & 2)
B) XML Schema spec is missing
some features that James wants (e.g. co-occurence constraints
for attributes, notation for unordered content, more flexibility for
the <all> group, constraints for root element) (points #3, 4,
5, 7)
C) XML Schema has poor
abstractions (points 6, 9)
D) unfounded criticism (point 8)
-- more on this later ;-)
Most of James's criticisms are valid in
themselves, but I don't think that they matter that much in the big
picture.
* Point A: the primer is much
easier to read than most specs, and most questions about legality in
Schema can be answered by the primer, at least for most users.
There is a supporting material already for Schema (e.g. O'Reilly
booklet) that is also pretty good, and I think the supporting
material fixes this problem. So I don't think this presents a
practical problem for most Schema users. If this were a
problem, then Schema wouldn't have the market acceptance that it
does. Unreadability of the Schema spec is only a problem if it
limits market acceptance.
* Point B: any time you freeze a
specification, you do so with some set of features that is less than
what some people would desire. Schema froze its spec much
earlier than Relax NG. Relax NG specifically addressed many of
the weaknesses of Schema as it got "close" to a W3C
recommendation. Schema 2.0 will address these issues, plus I'm
sure build on what we learn from Relax NG as well as customer
feedback. Also, there were good reasons for not adding in some
of these features. I know that some of the restrictions for
the <all> group were there because of performance difficulty
that streaming processors would have. My contention would be
that Schema is too feature-rich for version 1.0, not too
feature-poor, which is what James suggests. While I would be
very happy not to have to implement redefine or key/keyref in
Schema, those features were put in SPECIFICALLY TO MAXIMIZE MARKET
ACCEPTANCE.
* Point C: I don't think the
abstraction issues James raises are that significant. First of
all (point #6), lots of very successful type systems (SQL, Java,
C/C++) have builtin primitive types as distinct from constructed
types. It allows for more implementation
optimizations.
The reason default attributes were
added was because a lot of people want and need them in typical
applications. Now, I do think that there is an inherent
conflict between the way that structured applications &
unstructured applications want to access XML data. Structured
data access (like Java Beans) wants to access a named item known at
compile time, usually without regard to ordering. When I say
"webdav.resource.setModDate()" I want it to work regardless of the
ordering constraints. However, sometimes code written to
order-aware APIs like DOM has to interact with order-unaware code
(like JSR-31-JAXB). What we do in the Oracle implementation is
to analyze the co-occurence constraints at schema compilation time
to figure out if it is computationally simple to figure where newly
added elements are allowed to go when the ordering is unspecified,
and disallow unordered access (e.g. via JAXB or relational SQL
views) to documents conforming to schemas that are "too
complicated". The reason that there are no default elements
(which my customers would like) is that default elements values
cannot be specified in general without information as to the
ordering. James's solution to the problem is to get rid of
defaulting (a feature which has been a must in pretty much every
database implementation ever deployed). I would suggest
defining a subset of co-occurence constraints that allow for
unordered access (e.g. if there are no maxoccurs > 1 on any
sequence or choice model used for an element, it is easy to figure
out the order things go in). The market clearly wants default
values.
* Point D: James is complaining
about the limitations of current implementations with respect to
their handling of xsi:schemaLocation. This is clearly not a
problem with the Schema spec. I think the current Schema
implementations (given their existence) are better than most Relax
NG implementations (which are much worse, since they don't
exist). However, my experience has been that it is very nice
to use the schemaLocation tag, because without it, instances don't
know what type they are. If you say that validation is a
process requiring both an instance and a schema, this doesn't
interoperate well with most IETF standards that only refer to an
instance (via a URL) and where there is no standard way to
specify the schema separately. Having instances know what type
they are allows for lots of optimizations, such as compilation of
instances that conform to a particular schema
definition.
My conclusions: (Disclaimer: I don't
know Relax NG very well--just a onceover of the spec, but as James's
argument rests mostly on the faults of XML Schema, I can address
those well)
XML Schema is a better language for
IETF standards for the following reasons:
* It has (and will continue to have)
greater market acceptance than alternatives like Relax NG, and
getting the maximum number of people to review the protocol
definitions is more important than dealing with inconsitencies in
the schema language abstractions that only come up in corner cases
that nobody needs in IETF protocol standards. Market
acceptance has always been the primary focus of IETF standards work
(look at HTTP for Pete's sake), not purity of
abstraction
* Schema is more data-centric, and is
more natural for protocol data.
* A lot more work has been done on
optimization and performance of schemas than Relax NG, and I believe
that performance of validation will be a primary concern for IETF
protocol implementations. At Oracle, we've been working on XML
Schema compilation for 2 years. While I don't think we have
the implementation experience to demonstrate either way, my belief
is that performance of Schema validation vs. RelaxNG will track
market acceptance.
* I don't think the bugs or missing
features in Schema will affect protocol work in any way. Most
of the features in Schema (inheritance, substitution groups,
key/keyref) are unlikely to be used in IETF
recommendations.
* I think we understand the limitations
of Schema better
I'd hate to see the perfect become the
enemy of the good here.
--Eric Sedlar
P.S. Please CC me directly on any
replies--I'm not on this mailing list yet.
Thanks.