[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Guidelines -04
Hello Tim,
Many thanks for the quick reply. Sorry for the delay.
At 21:24 02/06/04 -0700, Tim Bray wrote:
Martin Duerst wrote:
Hmm... Martin's got some good points. This isn't a drop dead issue either
way I think.
I still have to look at the above mail in detail, and I can obviously
not speak for the I18N WG here, but I would just like to note that the
Character Model, and that recommendation in particular, is not only
addressed at XML, but any other (potential) formats, too.
Oops, yes, the TAG comment should have been clear that it was talking
about the XML case.
Thanks for the clarification.
Checking that the first two bytes in the input stream are not
FFFE or FEFF to reject UTF-16 can obviously be done very efficiently,
in various places. Checking that the input is in UTF-8 is a bit more
difficult, but a very simple finite state machine does the job.
Another alternative is to use an open-source parser and just very
slightly hack it so that you can ask it for the encoding that came in.
Actually, since encoding is in the infoset, a parser that doesn't tell you
is arguably nonconformant.
Very good point.
- The IETF has a clear preference for UTF-8 over UTF-16. UTF-8 is
core to RFC 2277, and is a draft standard (and on it's way to
an IETF standard). UTF-16 is only an informational RFC.
As a (mostly) C programmer, I also have a clear preference for UTF-8, and
for a variety of reasons I agree with the IETF. However, the Java
programmers have some pain here (yes I know that a java char isn't really
a UTF-16 char, but most programmers can pretend it is without causing
breakage).
UTF-16 is *not* going away.
I would never claim such a thing. Indeed, for *internal* representation,
UTF-16 is doing very well. Java and Windows are the two main examples.
But this draft is not about internal representation.
- There are some places where XML could be used where ASCII-compatibility
is crucial. Imagine using a small piece of XML in an http-like header.
Well, if it's got non-ASCII chars you're toast anyhow :)
Not necessarily. Netnews and http allow 8-bit headers, only smtp doesn't.
"XML could be used" and "ASCII-compatibility is crucial" feel to me like
objectives that are strongly in conflict.
I don't really understand that. Maybe you interpret 'ASCII-compatibility'
as 'ASCII and only ASCII'. This is not how I meant it. I meant it
in the sense it is provided by UTF-8. Sorry about the confusion.
- During the creation of XML, originally only UTF-8 was required.
But then there was very strong pressure to also include UTF-16.
To the extent I'm aware of, there is now considerably less
such pressure, if there is indeed still any.
I completely disagree both with the history and the assertion about
current trends, but I'm not sure this is relevant.
Let's leave history for another time. I would very much be interested
to hear about pressures for using UTF-16 *on the wire*.
Well, 'being liberal in what you accept' could be interpreted much
more liberally, e.g. accept all kinds of encodings. And given that
for most parsers, it's as difficult (or easy) to instruct them to
take only UTF-8 as it is to instruct them to take exactly UTF-8
and UTF-16, that may be where your argument is heading. But it's
very clear that this doesn't contribute to interoperability, which
is the final goal. If everybody sends UTF-8, that goal is met.
'be liberal in what you accept' is not really XML's motto either,
for very good reasons.
Actually, XML is *very* liberal in the particular case of character
encodings. This seems to be a popular choice. I do *not* believe that,
in the context of XML, debarring UTF-16 has any significant effect on
interoperability.
Does that mean that you say 'there is no significant difference,
in terms of interoperability, between allowing only UTF-8 and
allowing both UTF-8 and UTF-16'?
While you have mostly looked at the receiving end, do you think
there any major reason that 'only UTF-8' would put any significant
burdens on the sending side?
Yes, for Java programmers. I know the UTF-8 handling is much better than
it used to be, but UTF-8 was still a 2nd-class Java citizen last time I
looked. I'd be glad to hear I'm wrong; I've been working in C the last
couple of years.
Internally, Java uses UTF-16. In that sense, UTF-8 is
a second-class citizen for Java, like iso-8859-1 and all the
others. Also, you may refer to the fact that some transcoders
basically assumed UCS-2 rather than UTF-16. That fortunately
seems to be a thing of the past. I just dowloaded
JDK 1.2.2 for Windows, wrote a very small example, and here
is the source:
/**
* The utf16 class implements an application that
* outputs some UTF-16 text to the standard output as UTF-8.
*/
import java.io.*;
class utf16 {
public static void main(String[] args) {
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write("Hello World!
\ud800\udf07\ud800\udf04\ud800\udf0b\ud800\udf0b\ud800\udf0f
\ud800\udf1e\ud800\udf0f\ud800\udf13\ud800\udf0b\ud800\udf03!");
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The \uhhhh is 'Hello World' in Old Italic in UTF-16.
And here is the output, analyzed using cygwin:
$ od -hc test.txt
0000000 6548 6c6c 206f 6f57 6c72 2164 f020 8c90
H e l l o W o r l d ! 360 220 214
0000020 f087 8c90 f084 8c90 f08b 8c90 f08b 8c90
207 360 220 214 204 360 220 214 213 360 220 214 213 360 220 214
0000040 208f 90f0 9e8c 90f0 8f8c 90f0 938c 90f0
217 360 220 214 236 360 220 214 217 360 220 214 223 360 220
0000060 8b8c 90f0 838c 0021
214 213 360 220 214 203 ! \0
0000067
If you wind your head around the endianness issues, it looks
like perfectly okay UTF-8 with four bytes per character.
Of course the maileage on the Java implementation that you use
may vary.
In summary, I think that all the arguments given above together very
clearly support the current wording on character sets and encodings.
I can see both sides of it. But at moment saying UTF-8/16 seems like a
win on cost-benefit. -Tim
There is another famous IETF saying: "zero, one, many".
Using both UTF-8 and UTF-16 counts as two for me :-).
Regards, Martin.