|
I'd like to respond to James Clark's posting about
usage of XML Schema in the IETF. First of all, let me summarize my
experience with Schema:
* We've been
discussing the issues of using XML Schema a lot in the WebDAV working groups, as
this is clearly a next step we need to look into. Just as an FYI, while
many of the implementors on that list have brought up the topic of moving from
DTDs to Schema, nobody to date has brought up the idea of Relax NG or any other
XML formalism.
* We've incorporated XML Schema support into the
Oracle database in version 9iR2, and so we've been educating lots of people,
from developers, marketing, education and beta customers, on how to use XML
Schema.
My anecdotal observations so far:
* The W3C Schema Recommendations (part 1 & 2)
are inaccessible to most developers except for a core group of our schema
implementors. You cannot learn how to write schemas from these
specs.
* Most people learn XML Schema from the primer
(part 0), which is very well written and very accessible. There are also a
lot of good books out there (there's an O'Reilly pamphlet that's really
useful). Once they learn Schema from these documents, they operate at a
pretty high level of facility. Schema kind of "feels like" type definition
languages they've seen before, like SQL DDL or Java classes.
* Most people cannot learn XSLT or other W3C
standards from the W3C specifications, either. However, the problem with
XSLT (for example) is not so much the writing of the spec (which is much less
obtuse than the Schema spec) but the mindset required to write it
effectively--the recursive templating nature of the XSLT language is pretty
difficult to master. So, XSLT is easier to learn, but harder to use.
Also, because of this paradigm difference, there are major performance problems
with XSLT compared to more traditional paradigms like JSPs in actual customer
applications.
* The number one barrier to acceptance of any
formalism more complete than DTDs is market acceptance--how many people have
bothered to even get a simple understanding of it. Maybe about half
of the people on the DAV WG know enough Schema to be able to review specs
written in the language. I don't think anybody on the WG knows enough
RELAX NG to where we could use it. The earlier versions of the WebDAV ACL
spec I wrote using XML Schema were rewritten into DTDs just to promote greater
awareness.
* Most of our implementors are using tools like XML
Spy to develop schemas
My thoughts:
* XML Schema was developed as a compromise between
the data-oriented people (like Oracle) and the document-oriented people.
It has some of the problems that occur when things are designed by committee to
meet the needs of multiple constituencies. However, that's also why it has
a lot more market acceptance.
* If I were comparing XML structure definition
languages to programming languages, I would say XML Schema is like C++, and
Relax NG is like Lisp. C++ was another one of those languages designed by
committee, and has some of the same problems around the edges (what happens
again to the object's allocation when I throw an exception from a constructor,
or using virtual methods in a constructor??). However, I don't think that
the problems of C++ carry over as much in schema definition, as graphical tools
are available to construct schemas that work pretty well, whereas code
generation doesn't really work. I expect that XML Schema & Relax NG
will see about the same level of market acceptance as C++ and Lisp did,
respectively, in the 90's, with possibly more dominance for Schema, because of
the abilities of tools to mitigate the complexity.
* I think the number one thing that is important
for the IETF in recommending an XML structure language is market acceptance,
since it is the content of the protocol definition that needs to be
reviewed, and we need to use a language that has as many people as possible
conversant with it.
* XML Schema feels more data-oriented while Relax
NG feels more document oriented. However, most IETF protocols are
basically pretty structured data-oriented, but with variances in the structure
that benefit from XML. I worry that Relax NG validation performance will
compare to Schema validation performance in the same way that XSLT compares to
JSPs.
* While Relax NG went through a standards process,
it didn't have a lot of participation in the process, and I don't believe it
meets the needs of all the potential consituencies as well as Schema does due to
that lack of participation. It's kind of like the SQL99 standards for
multimedia or objects in databases--it's an ISO/ANSI standard that Oracle moved
through the process mostly by itself, but most vendors don't implement
it.
* The #1 problem with using XML in IETF protocols,
in my opinion, is not being able to put binary data in directly. It would
certainly be possible to add something like chunked-transfer-encoding in XML 2.0
(Core), and I think if the IETF is going to criticize the work of the W3C, that
would be a more useful avenue than criticizing Schema. I would love to say
<content length="2e45">binary stuff</content> in my protocol
messages rather than forcing a base64 encoding.
* DTDs are definitely not good enough to express
the XML needed in protocol definitions. There are WAY too many "any"
declarations in the DTDs we use in WebDAV, not to mention the need for primitive
datatypes.
To James's specific criticisms of XML Schema, I
would say they fall into four camps:
A) XML Schema Spec is hard to read, and is
unintuitive to the uninitiated (points #1 & 2)
B) XML Schema spec is missing some
features that James wants (e.g. co-occurence constraints for attributes,
notation for unordered content, more flexibility for the <all> group,
constraints for root element) (points #3, 4, 5, 7)
C) XML Schema has poor abstractions
(points 6, 9)
D) unfounded criticism (point 8) -- more on
this later ;-)
Most of James's criticisms are valid in themselves,
but I don't think that they matter that much in the big picture.
* Point A: the primer is much easier to read
than most specs, and most questions about legality in Schema can be answered by
the primer, at least for most users. There is a supporting material
already for Schema (e.g. O'Reilly booklet) that is also pretty good, and I think
the supporting material fixes this problem. So I don't think this presents
a practical problem for most Schema users. If this were a problem, then
Schema wouldn't have the market acceptance that it does. Unreadability of
the Schema spec is only a problem if it limits market acceptance.
* Point B: any time you freeze a
specification, you do so with some set of features that is less than what some
people would desire. Schema froze its spec much earlier than Relax
NG. Relax NG specifically addressed many of the weaknesses of Schema as it
got "close" to a W3C recommendation. Schema 2.0 will address these issues,
plus I'm sure build on what we learn from Relax NG as well as customer
feedback. Also, there were good reasons for not adding in some of these
features. I know that some of the restrictions for the <all> group
were there because of performance difficulty that streaming processors would
have. My contention would be that Schema is too feature-rich for version
1.0, not too feature-poor, which is what James suggests. While I would be
very happy not to have to implement redefine or key/keyref in Schema, those
features were put in SPECIFICALLY TO MAXIMIZE MARKET ACCEPTANCE.
* Point C: I don't think the abstraction
issues James raises are that significant. First of all (point #6), lots of
very successful type systems (SQL, Java, C/C++) have builtin primitive types as
distinct from constructed types. It allows for more implementation
optimizations.
The reason default attributes were added was
because a lot of people want and need them in typical applications. Now, I
do think that there is an inherent conflict between the way that structured
applications & unstructured applications want to access XML data.
Structured data access (like Java Beans) wants to access a named item known at
compile time, usually without regard to ordering. When I say
"webdav.resource.setModDate()" I want it to work regardless of the ordering
constraints. However, sometimes code written to order-aware APIs like DOM
has to interact with order-unaware code (like JSR-31-JAXB). What we do in
the Oracle implementation is to analyze the co-occurence constraints at schema
compilation time to figure out if it is computationally simple to figure where
newly added elements are allowed to go when the ordering is unspecified, and
disallow unordered access (e.g. via JAXB or relational SQL views) to documents
conforming to schemas that are "too complicated". The reason that there
are no default elements (which my customers would like) is that default elements
values cannot be specified in general without information as to the
ordering. James's solution to the problem is to get rid of defaulting (a
feature which has been a must in pretty much every database implementation ever
deployed). I would suggest defining a subset of co-occurence constraints
that allow for unordered access (e.g. if there are no maxoccurs > 1 on any
sequence or choice model used for an element, it is easy to figure out the order
things go in). The market clearly wants default values.
* Point D: James is complaining about the
limitations of current implementations with respect to their handling of
xsi:schemaLocation. This is clearly not a problem with the Schema
spec. I think the current Schema implementations (given their existence)
are better than most Relax NG implementations (which are much worse, since they
don't exist). However, my experience has been that it is very nice to use
the schemaLocation tag, because without it, instances don't know what type they
are. If you say that validation is a process requiring both an instance
and a schema, this doesn't interoperate well with most IETF standards that only
refer to an instance (via a URL) and where there is no standard way to
specify the schema separately. Having instances know what type they are
allows for lots of optimizations, such as compilation of instances that conform
to a particular schema definition.
My conclusions: (Disclaimer: I don't know Relax NG
very well--just a onceover of the spec, but as James's argument rests mostly on
the faults of XML Schema, I can address those well)
XML Schema is a better language for IETF standards
for the following reasons:
* It has (and will continue to have) greater market
acceptance than alternatives like Relax NG, and getting the maximum number of
people to review the protocol definitions is more important than dealing with
inconsitencies in the schema language abstractions that only come up in corner
cases that nobody needs in IETF protocol standards. Market acceptance has
always been the primary focus of IETF standards work (look at HTTP for Pete's
sake), not purity of abstraction
* Schema is more data-centric, and is more natural
for protocol data.
* A lot more work has been done on optimization and
performance of schemas than Relax NG, and I believe that performance of
validation will be a primary concern for IETF protocol implementations. At
Oracle, we've been working on XML Schema compilation for 2 years. While I
don't think we have the implementation experience to demonstrate either way, my
belief is that performance of Schema validation vs. RelaxNG will track market
acceptance.
* I don't think the bugs or missing features in
Schema will affect protocol work in any way. Most of the features in
Schema (inheritance, substitution groups, key/keyref) are unlikely to be used in
IETF recommendations.
* I think we understand the limitations of Schema
better
I'd hate to see the perfect become the enemy of the
good here.
--Eric Sedlar
P.S. Please CC me directly on any
replies--I'm not on this mailing list yet. Thanks.
|