[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
libraries and external charset detection (was: Re: well-formedness error)
At 19:29 04/06/16 -0400, Sam Ruby wrote:
As far as I know, NONE of the popular libraries, on all the popular
platforms, and in all the popular languages, take RFC 3023 into
consideration. Nor do they provide ANY mechanism for the caller to
indicate the "Presence of External Encoding Information" [1].
The way the feed validator accomplishes this function is to actually open
the stream, peek at the first few bytes, determine the declared encoding,
and actually REPLACE the prolog if necessary to get these to match.
The way the RDF validator at http://www.w3.org/RDF/Validator/ does
it is to look at headers and maybe the first few bytes of the actual
document to determine the encoding, and then decodes the bytes stream
and hands the document over to the parser as as stream of characters.
The parser then doesn't look at the encoding pseudo-attribute anymore,
because it already gets characters. This depends on using a language
such as Java that makes a clear distinction between bytes and characters,
and on using a parser that takes the document as a stream of characters.
Regards, Martin.
P.S.: The code is at
http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.java
There is some similar code at
http://validator.w3.org/source/. You are free to reuse/copy/adapt it.
Perhaps somebody out there will know of one or more libraries that
actually do provide such support or interface; and if so, I would be
interested in hearing about it. But even if such occurs, this would not
change the reality that the overwhelming majority of applications and
libraries consider the XML prolog to be authoritative.
- Sam Ruby
[1]<http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-with-ext-info>