[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some text that may be useful for the update of RFC 2376



In message "Re: Some text that may be useful for the update of RFC 2376",
Rick Jelliffe wrote...
 >What about this?
 >
 >	1) In all cases, charset parameter is required.
 >	There is no default. Failure is an unrecoverable
 >	error, for general applications. Detection is
 >	mandatory.

This is a change I can agree on.

 >	2) In all cases, all code sequences in
 >	the document must match code sequences allowed
 >	by the encoding specified by the charset parameter.
 >	Failure is an unrecoverable error, for general
 >	applications. Detection is not mandatory.

Agreed.  I think that this is not an issue of RFC 2376 but an 
issue of XML 1.0.
	
 >	3) In all cases, if the document starts with a BOM,
 >	the charset parameter must indicate which flavour
 >	of UTF-16 is being used. There is no default.
 >	Failure is an unrecoverable error, for general
 >	applications. Detection is not mandatory, but should
 >	be made so at some future date.

UTF-16le and UTF-16be cannot be used for XML.  XML mandates 
the BOM for utf-16.  Meanwhile, utf-16le and utf-16be cannot 
have the BOM.  More about this, see RFC 2781.

 >	4) If the document is sent text/xml, the encoding
 >	parameter of the XML header is not checked. However,
 >	well-behaved systems should rewrite the encoding
 >	attribute of the XML header to agree with charset 
 >	parameter. 

When the recipient has to discard the MIME header, it has 
to change the encoding PI.  I believe that RFC 2376 already 
covers this.

 >	5) If the data is sent application/xml then
 >	the charset parameter must agree with the
 >	encoding attribute of the XML header. Failure is
 >	an unrecoverable error, for general applications.
 >	Detection is not mandatory.

In other words, you are proposing that XML-unaware transcoders 
should not be used for application/xml.  Since I would like to encourage 
effecient and generic transcoders, I am reluctant.

 >	6) The rules above can be bent or strengthened for
 >	specialist applications, by specific agreement between
 >	the recipient and sending parties. The main 
 >	alteration envisaged would be to allow, as an 
 >	obvious error-recovery strategy, that if the 
 >	charset parameter is missing, the encoding attribute
 >	of the XML header can be used. Another alteration
 >	envisaged is for some defaulting to be used.
 >	However, specialist applications which require this
 >	behaviour should not, in general be using text/xml*
 >	or application/xml*.

Some restrictions are useful for some XML-based media types.  For 
example, application/iotp-xml might allow Unicode only.   I am 
willing to mention such restrictions in the I-D. 

 >Discussion:
 >
 >The reason for 1) is that we have a clash between user expections
 >(iso8859-1), RFCs (US-ASCII) and XML defaults (UTF-8). There is
 >no winnable solution to defaults. 

I am personally happy to mandate the charset parameter.  

When RFC 2376 was sent to the IAB, the default for text/xml in the 
case of HTTP was 8859-1.  The IAB suggested US-ASCII.

 >The reason for 2) is simply to state clearly that error-recovery
 >from corrupted data is not the norm.
 >
 >The reason for 3) is that, as Murata-san's proposed
 >Japanese Profile of XML makes clear, there are Japanese flavours
 >of Unicode floating about.

As Martin corrected, conversion tables are ambiguous.  But there 
are no flavors of Unicode.

 >The reason for 5) is that the reason why we have application/xml
 >as well as text/xml is to prevent point-to-point manipulation of
 >the data. It should be treated like a binary file. It should 
 >allow end-to-end data integrity. 

I do not understand why we have to prohibit transcoding that 
does not rewrite encoding declarations.  The main argument against 
the charset parameter is that it is often missing or incorrect.  
Application/xml allows the omission of the charset parameter.  
If it is omitted, we rely on autodection described in XML 1.0.  
I believe that it was Martin who proposed this compromise in the 
W3C XML SIG and everybody can live with it.

I see no reasons for preserving byte sequences.  We only have to 
preserve XML information sets.

 >(There is a fundamental weak point in point-to-point charset 
 >parameter transmission: there is no standard mechanism for 
 >registering the character set of individual files which a 
 >webserver can pick up: furthermore, some programming languages 

AddType and AddCharset of Apache allows registeration for 
each directory.  We can also use conventions for file extensions.

It would be great if the W3C team further enhances Apache.

 >such as C do not have a character type but operate on storage types, 
 >so the encoding data is not available automatically anyway; 

Existing programming languages do not support Unicode very well, as 
I see it.

 >also, on UNIX systems using pipes, there is no parallel channel 
 >available for out-of-band information between the processes on 
 >either side of the pipe, so encoding information may be
 >difficult to propogate automatically. 

This is true, but programs interchange DOM data rather than textual 
XML.



----
MURATA Makoto  muraw3c@xxxxxxxxxxxxx