From: Kai Henningsen (kaih@khms.westfalen.de)
Date: Sun Jan 30 2000 - 06:01:00 CST
chl@clw.cs.man.ac.uk (Charles Lindsey) wrote on 28.01.00 in <Fp1HDt.3JD@clw.cs.man.ac.uk>:
> But I don't agree that an agent CAN easily detect it. In section 2, we
> have the grammar
>
> UTF8-xtra-head = %d192-255
> UTF8-xtra-tail = %d128-191
> UTF8-xtra-char = UTF8-xtra-head 1*UTF8-xtra-tail
> to which you can add
> ASCII-char = %d00-127
>
> Huge quatities of Latin 1 and other character sets would pass that test,
I seriously doubt that assertion.
> NOTE: There are a few sequences of octets (e.g. %d255 followed by %d255)
> which cannot legitimately occur in UTF-8. These SHOULD NOT be generated by
No bytes 0xFF and 0xFE are ever legal in UTF-8.
No bytes in the range 0xF5 to 0xFD are ever legal in the set reachable
with Unicode, and the relevant committees have promised to never assign
codes outside that area (0x00000000 - 0x0010FFFF).
And of course, only the following pairs are ever legal in UTF-8:
0x00-0x7F 0x00-0x7F, 0xC0-0xFD
0x80-0xBF 0x00-0xFD
0xC0-0xFD 0x80-0xBF
MfG Kai