[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: well-formedness error




On Jun 18, 2004, at 9:45 AM, Walter Underwood wrote:


I think that people have been "well-formed" to mean "well-formed
in the context of an externally specified encoding". I want us to
use "well-formed" to mean exactly what it means in the XML 1.0
spec. Appendix F talks about external encodings, but it does not
make them part of the well-formed definition. It also says that
RFCs are more authoritative than the XML 1.0 spec on this subject.
To me, that means we need a different term when we  talk about XML
with externally specified encoding.

Well, the Webarch document and general good practice agree that if I send you a message with


Content-encoding: text/xml; charset=UTF-8

and it contains

<?xml version="1.0" encoding="iso-8859-1" ?>
<pløtz/>

where the ø is the single byte U+00F8, then that's *busted*. I foolishly claimed in a previous message "well, the XML is still well-formed" and it was pointed out to me (politely, off-line) that well, not really, because the XML processor is required to take the header's word that this is UTF-8. Having said that, Walter is right and it's useful to distinguish between the conditions where the XML content is intrinsically ill-formed and those where the header is borked. Our specs should make this distinction. It may the case that Internet Orthodoxy forbids software from reacting differently, although, even though I'm resolutely draconian, recovery from broken headers feels less pernicious to me than recovery from a missing end-tag.

You know, we could specify that Atom MUST always be encoded in UTF-8 and/or that the root element must be <Atøm>. Then, we'd have belt-and-suspenders safety in the face of the most deranged encoding breakage. No, that's probably not a serious suggestion. -Tim