[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RELAX NG and W3C XML Schema



I'd like to respond to James Clark's posting about usage of XML Schema in the IETF.  First of all, let me summarize my experience with Schema:
 
We've been discussing the issues of using XML Schema a lot in the WebDAV working groups, as this is clearly a next step we need to look into.  Just as an FYI, while many of the implementors on that list have brought up the topic of moving from DTDs to Schema, nobody to date has brought up the idea of Relax NG or any other XML formalism.
 
* We've incorporated XML Schema support into the Oracle database in version 9iR2, and so we've been educating lots of people, from developers, marketing, education and beta customers, on how to use XML Schema.
 
 
My anecdotal observations so far:
 
* The W3C Schema Recommendations (part 1 & 2) are inaccessible to most developers except for a core group of our schema implementors.  You cannot learn how to write schemas from these specs.
* Most people learn XML Schema from the primer (part 0), which is very well written and very accessible.  There are also a lot of good books out there (there's an O'Reilly pamphlet that's really useful).  Once they learn Schema from these documents, they operate at a pretty high level of facility.  Schema kind of "feels like" type definition languages they've seen before, like SQL DDL or Java classes.
* Most people cannot learn XSLT or other W3C standards from the W3C specifications, either.  However, the problem with XSLT (for example) is not so much the writing of the spec (which is much less obtuse than the Schema spec) but the mindset required to write it effectively--the recursive templating nature of the XSLT language is pretty difficult to master.  So, XSLT is easier to learn, but harder to use.  Also, because of this paradigm difference, there are major performance problems with XSLT compared to more traditional paradigms like JSPs in actual customer applications.
* The number one barrier to acceptance of any formalism more complete than DTDs is market acceptance--how many people have bothered to even get a simple understanding of it.  Maybe about half of the people on the DAV WG know enough Schema to be able to review specs written in the language.  I don't think anybody on the WG knows enough RELAX NG to where we could use it.  The earlier versions of the WebDAV ACL spec I wrote using XML Schema were rewritten into DTDs just to promote greater awareness.
* Most of our implementors are using tools like XML Spy to develop schemas
 
 
My thoughts:
 
* XML Schema was developed as a compromise between the data-oriented people (like Oracle) and the document-oriented people.  It has some of the problems that occur when things are designed by committee to meet the needs of multiple constituencies.  However, that's also why it has a lot more market acceptance.
* If I were comparing XML structure definition languages to programming languages, I would say XML Schema is like C++, and Relax NG is like Lisp.  C++ was another one of those languages designed by committee, and has some of the same problems around the edges (what happens again to the object's allocation when I throw an exception from a constructor, or using virtual methods in a constructor??).  However, I don't think that the problems of C++ carry over as much in schema definition, as graphical tools are available to construct schemas that work pretty well, whereas code generation doesn't really work.  I expect that XML Schema & Relax NG will see about the same level of market acceptance as C++ and Lisp did, respectively, in the 90's, with possibly more dominance for Schema, because of the abilities of tools to mitigate the complexity.
* I think the number one thing that is important for the IETF in recommending an XML structure language is market acceptance, since it is the content of the protocol definition that needs to be reviewed, and we need to use a language that has as many people as possible conversant with it.
* XML Schema feels more data-oriented while Relax NG feels more document oriented.  However, most IETF protocols are basically pretty structured data-oriented, but with variances in the structure that benefit from XML.  I worry that Relax NG validation performance will compare to Schema validation performance in the same way that XSLT compares to JSPs.
* While Relax NG went through a standards process, it didn't have a lot of participation in the process, and I don't believe it meets the needs of all the potential consituencies as well as Schema does due to that lack of participation.  It's kind of like the SQL99 standards for multimedia or objects in databases--it's an ISO/ANSI standard that Oracle moved through the process mostly by itself, but most vendors don't implement it.
* The #1 problem with using XML in IETF protocols, in my opinion, is not being able to put binary data in directly.  It would certainly be possible to add something like chunked-transfer-encoding in XML 2.0 (Core), and I think if the IETF is going to criticize the work of the W3C, that would be a more useful avenue than criticizing Schema.  I would love to say <content length="2e45">binary stuff</content> in my protocol messages rather than forcing a base64 encoding.
* DTDs are definitely not good enough to express the XML needed in protocol definitions.  There are WAY too many "any" declarations in the DTDs we use in WebDAV, not to mention the need for primitive datatypes.
 
To James's specific criticisms of XML Schema, I would say they fall into four camps:
A) XML Schema Spec is hard to read, and is unintuitive to the uninitiated (points #1 & 2)
B) XML Schema spec is missing some features that James wants (e.g. co-occurence constraints for attributes, notation for unordered content, more flexibility for the <all> group, constraints for root element) (points #3, 4, 5, 7)
C) XML Schema has poor abstractions  (points 6, 9)
D) unfounded criticism (point 8) -- more on this later ;-)
 
Most of James's criticisms are valid in themselves, but I don't think that they matter that much in the big picture.
 
* Point A:  the primer is much easier to read than most specs, and most questions about legality in Schema can be answered by the primer, at least for most users.  There is a supporting material already for Schema (e.g. O'Reilly booklet) that is also pretty good, and I think the supporting material fixes this problem.  So I don't think this presents a practical problem for most Schema users.  If this were a problem, then Schema wouldn't have the market acceptance that it does.  Unreadability of the Schema spec is only a problem if it limits market acceptance.
 
* Point B:  any time you freeze a specification, you do so with some set of features that is less than what some people would desire.  Schema froze its spec much earlier than Relax NG.  Relax NG specifically addressed many of the weaknesses of Schema as it got "close" to a W3C recommendation.  Schema 2.0 will address these issues, plus I'm sure build on what we learn from Relax NG as well as customer feedback.  Also, there were good reasons for not adding in some of these features.  I know that some of the restrictions for the <all> group were there because of performance difficulty that streaming processors would have.  My contention would be that Schema is too feature-rich for version 1.0, not too feature-poor, which is what James suggests.  While I would be very happy not to have to implement redefine or key/keyref in Schema, those features were put in SPECIFICALLY TO MAXIMIZE MARKET ACCEPTANCE.
 
* Point C:  I don't think the abstraction issues James raises are that significant.  First of all (point #6), lots of very successful type systems (SQL, Java, C/C++) have builtin primitive types as distinct from constructed types.  It allows for more implementation optimizations.
 
The reason default attributes were added was because a lot of people want and need them in typical applications.  Now, I do think that there is an inherent conflict between the way that structured applications & unstructured applications want to access XML data.  Structured data access (like Java Beans) wants to access a named item known at compile time, usually without regard to ordering.  When I say "webdav.resource.setModDate()" I want it to work regardless of the ordering constraints.  However, sometimes code written to order-aware APIs like DOM has to interact with order-unaware code (like JSR-31-JAXB).  What we do in the Oracle implementation is to analyze the co-occurence constraints at schema compilation time to figure out if it is computationally simple to figure where newly added elements are allowed to go when the ordering is unspecified, and disallow unordered access (e.g. via JAXB or relational SQL views) to documents conforming to schemas that are "too complicated".  The reason that there are no default elements (which my customers would like) is that default elements values cannot be specified in general without information as to the ordering.  James's solution to the problem is to get rid of defaulting (a feature which has been a must in pretty much every database implementation ever deployed).  I would suggest defining a subset of co-occurence constraints that allow for unordered access (e.g. if there are no maxoccurs > 1 on any sequence or choice model used for an element, it is easy to figure out the order things go in).  The market clearly wants default values.
 
* Point D:  James is complaining about the limitations of current implementations with respect to their handling of xsi:schemaLocation.  This is clearly not a problem with the Schema spec.  I think the current Schema implementations (given their existence) are better than most Relax NG implementations (which are much worse, since they don't exist).  However, my experience has been that it is very nice to use the schemaLocation tag, because without it, instances don't know what type they are.  If you say that validation is a process requiring both an instance and a schema, this doesn't interoperate well with most IETF standards that only refer to an instance (via a URL) and where there is no standard way to specify the schema separately.  Having instances know what type they are allows for lots of optimizations, such as compilation of instances that conform to a particular schema definition.
 
 
My conclusions: (Disclaimer: I don't know Relax NG very well--just a onceover of the spec, but as James's argument rests mostly on the faults of XML Schema, I can address those well)
 
XML Schema is a better language for IETF standards for the following reasons:
 
* It has (and will continue to have) greater market acceptance than alternatives like Relax NG, and getting the maximum number of people to review the protocol definitions is more important than dealing with inconsitencies in the schema language abstractions that only come up in corner cases that nobody needs in IETF protocol standards.  Market acceptance has always been the primary focus of IETF standards work (look at HTTP for Pete's sake), not purity of abstraction
* Schema is more data-centric, and is more natural for protocol data.
* A lot more work has been done on optimization and performance of schemas than Relax NG, and I believe that performance of validation will be a primary concern for IETF protocol implementations.  At Oracle, we've been working on XML Schema compilation for 2 years.  While I don't think we have the implementation experience to demonstrate either way, my belief is that performance of Schema validation vs. RelaxNG will track market acceptance.
* I don't think the bugs or missing features in Schema will affect protocol work in any way.  Most of the features in Schema (inheritance, substitution groups, key/keyref) are unlikely to be used in IETF recommendations.
* I think we understand the limitations of Schema better
 
 
I'd hate to see the perfect become the enemy of the good here.
 
--Eric Sedlar
 
P.S.  Please CC me directly on any replies--I'm not on this mailing list yet.  Thanks.