[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

libraries and external charset detection (was: Re: well-formedness error)




At 19:29 04/06/16 -0400, Sam Ruby wrote:


As far as I know, NONE of the popular libraries, on all the popular platforms, and in all the popular languages, take RFC 3023 into consideration. Nor do they provide ANY mechanism for the caller to indicate the "Presence of External Encoding Information" [1].

The way the feed validator accomplishes this function is to actually open the stream, peek at the first few bytes, determine the declared encoding, and actually REPLACE the prolog if necessary to get these to match.

The way the RDF validator at http://www.w3.org/RDF/Validator/ does it is to look at headers and maybe the first few bytes of the actual document to determine the encoding, and then decodes the bytes stream and hands the document over to the parser as as stream of characters. The parser then doesn't look at the encoding pseudo-attribute anymore, because it already gets characters. This depends on using a language such as Java that makes a clear distinction between bytes and characters, and on using a parser that takes the document as a stream of characters.

Regards, Martin.

P.S.: The code is at http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.java
There is some similar code at
http://validator.w3.org/source/. You are free to reuse/copy/adapt it.



Perhaps somebody out there will know of one or more libraries that actually do provide such support or interface; and if so, I would be interested in hearing about it. But even if such occurs, this would not change the reality that the overwhelming majority of applications and libraries consider the XML prolog to be authoritative.

- Sam Ruby

[1]<http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-with-ext-info>