[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: well-formedness error




Mark Pilgrim wrote:
Well-formed: http://feedparser.org/tests/wellformed/encoding/http_i18n.xml
Not well-formed: http://feedparser.org/tests/illformed/encoding/http_i18n.xml

The only difference is the Content-type header.  The first feed uses
"application/xml", but the second feed uses "text/xml".

Universal Feed Parser correctly reports the first feed as well-formed
(bozo=0) and the second feed as non-well-formed (bozo=1,
bozo_exception=CharacterEncodingOverride).

If your parser thinks the second feed is well-formed, your parser is
broken and should be fixed.

OK, the parsers I tried on my system (including MSXML3, MSXML4, libxml and Instant-Saxon's builtin parser) all accept this. I do agree that this is a problem.


The issue here is that it's extremely unlikely that the situation is ever going to improve, because

- there is so much RFC3023-incompliant XML content on the web,

- for most parser APIs, it's either hard or even impossible to do that check reliably -- you would either need the ability to pass-in the expected encoding (and let the parser catch fire when it discovers this isn't the actual one), or after parsing get the information back what the actual encoding was.

For Atom, this "problem" will just cease to exist if we define the correct content type(s) ("application/...") and make the behaviour for other content types undefined.

At the end of the day, the text/* default encoding vs XML thing is a complete mess. However, I see people using this as proof that draconian XML error checking is the wrong thing to do, and I don't think that line of argument is acceptable. In the case of a RFC3023-based problem, people often *can't* do the proper checking (*). On the other hand, if there's a XML-wf problem with the actual payload, users of regular XML parsers simply have no chance to parse the content without falling back to not to use the XML parser at all.

So, if the Atom community can't live with the requirement of XML-wellformed *payloads* (minus MIME type info), by all means do *not* use XML at all.

Best regards, Julian


(*) It would be nice to have example code for popular XML parser APIs (Java, MSXML...) that demonstrates how to add a robust RFC3023-compliance check.



-- <green/>bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760