[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: UTF-8 iCalendar bug solution




In all uses of the 'NON-US-ASCII' ABNF definition in 2445 it allows for zero or MORE octets. It looks to me as if it does allow for 1 or more octet (multibyte) characters.

Assuming you are correct and it should be %x80-FF, that looks to me
as if it would solve the problem by changing that alone. Have you verified
that all of the valid UTF-8 octet sequences that can include 0xF9-FF are
printable (not control) characters? If not we would have to also exclude
those characters sequences.

Changing from UTF-8 to UTF-16 (which I think is your proposal) would
break existing iMIP implementations.

If you are talking about the standard multibyte to wide charset
translation functions you are talking about changing the charset
definition from ASCII valid to ASCII not valid. Although numerically
'A' == 0x41 == 0x0041, in the UTF-8 charset the sequence  0x0041 != 'A'.
When using standard string functions existing implementations would break.
If an existing implementation internally already used standard multibyte to
wide character conversion functions, then when it converted to wide
characters (not knowing it already was as you propose) and it got a a sequence
of 0x0041, it would stop the conversion of UTF-8 at the 0x00 octet
and never see the 'A' character. And for implementations that use
multibyte (non-wide char) aware string functions, they would expect
UTF-8 and again would stop at the 0x00 octet never seeing the 'A'.

Mark Swanson wrote:


#1 in more detail:


(Quote from the spec as a reminder that UTF-8 is the default charset)
<quote>
4.1.4 Character Set
There is not a property parameter to declare the character set used
in a property value. The default character set for an iCalendar
object is UTF-8 as defined in [RFC 2279].
</quote>

Note that the definition of NON-US-ASCII is %x80-F8. This excludes the
following characters:

...

A quick fix might be to change NON-US-ASCII to "%x80-FF" but since this only
deals with a single character and does not handle every 2+ byte UTF-8
character this isn't good enough. A unified solution is presented at
the bottom.

All instances of the ABNF that use NON-US-ASCII ABNF definition use *<Name> which allows for ZERO or more octets.


#2 in more detail


Note the following definition (and comment):



Change all 8-bit byte values to 16-bit byte values like this:


NON-US-ASCII = \u0080 - \uffff

Which is UTF-16 - correct?


--

 Doug Royer                     |   http://INET-Consulting.com
 -------------------------------|-----------------------------
 Doug@xxxxxxxxx                 | Office: (208)612-INET
 http://Royer.com/People/Doug   |    Fax: (866)594-8574
                                |   Cell: (208)520-4044

We Do Standards - You Need Standards

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature