From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Thu Jul 04 2002 - 06:24:54 CDT
In <Pine.BSI.3.91.1020702223419.496C-100000@spsystems.net> Henry Spencer <henry@spsystems.net> writes:
>The issue is that because there is more than one way to represent a given
>character in UTF-8 -- for example, you could represent "/" (U+002f) as
>0xc0,0xaf -- it is problematic to recognize dangerous metacharacters when
>they are encoded in UTF-8. (Newer definitions of UTF-8 often forbid such
>"overlong" sequences, older ones usually didn't.)
Indeed. The latest draft, which I expect to become an RFC fairly soon,
absolutely forbids either Generating or Recognizing the overlong
sequences. Our own draft is the same - trying to interpret them as UTF-8
is a MUST NOT. However, we are in the process of changing that to say that
you MAY try to interpret them as some totally different non-UTF-8
character set, if you can figure out which one it might be.
>So, of course, the fix that everyone is enthused about is neither the
>obvious one nor the best one: forbid overlong sequences!
I don't see why that is not "the best". One other benefit is that you can
use simply equality of octets to check for equality of strings (assuming
all the other normalizations are done also, of course).
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5