Re: Yes, Rat's nest

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Thu Jul 04 2002 - 06:24:54 CDT


In <Pine.BSI.3.91.1020702223419.496C-100000@spsystems.net> Henry Spencer <henry@spsystems.net> writes:

>The issue is that because there is more than one way to represent a given
>character in UTF-8 -- for example, you could represent "/" (U+002f) as
>0xc0,0xaf -- it is problematic to recognize dangerous metacharacters when
>they are encoded in UTF-8. (Newer definitions of UTF-8 often forbid such
>"overlong" sequences, older ones usually didn't.)

Indeed. The latest draft, which I expect to become an RFC fairly soon,
absolutely forbids either Generating or Recognizing the overlong
sequences. Our own draft is the same - trying to interpret them as UTF-8
is a MUST NOT. However, we are in the process of changing that to say that
you MAY try to interpret them as some totally different non-UTF-8
character set, if you can figure out which one it might be.

>So, of course, the fix that everyone is enthused about is neither the
>obvious one nor the best one: forbid overlong sequences!

I don't see why that is not "the best". One other benefit is that you can
use simply equality of octets to check for equality of strings (assuming
all the other normalizations are done also, of course).

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.