Re: Transformation of Non-ASCII headers

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Mon Feb 24 2003 - 16:11:02 CST


In <3E58F9A8.5040700@Sonietta.blilly.com> Bruce Lilly <blilly@erols.com> writes:

>All valid utf-8 sequences begin with one octet in the range
>0xc0 - 0xdf and are followed by octets in the range 0x80 - 0xbf.
>The widely used iso-8859 charsets also include all of 0xc0 - 0xdf
>and half of the 0x80 - 0x9f range, viz 0xa0 - 0xbf. Therefore
>fully half of all valid utf-8 sequences are also valid iso-8859
>sequences.

But that is again the wrong question. What you need to ask is what
percentage of iso-8859 sequences that are also valid utf-8 sequences.

If you take a random sequence of iso-8859 characters, ingoring the CTLs
(0x00-1f) and the extra CTLs (0x80-9f), then it is easy to calculate the
probability that it might be a valid UTF-8 sequence, according to the
length of the sequence. Assume that Unicode points above U+10ffff are
illegal, and forget about the absent surrogate cases.

If P(n) is the probability that a sequence of length n was valid UTF-8,
then the formula is

P(n) = 96/256 * P(n-1) = .3750 * P(n-1)
     + 32*32/(256^2) * P(n-2) = .0156 * P(n-2)
     + 16*32^2/(256^3) * P(n-3) = .0010 * P(n-3)
     + 8*32^3/(256^4) * P(n-4) = .0000 * P(n-4)

The results are as follows:

length probability
of seq of valid utf-8

0 1 = 100%
1 .3750 = 37.50%
3 .1406+.0056 = 14.62%
4 .0548+.0059+.0010 = 6.17%
5 .0231+.0023+.0004 = 2.58%
6 .0097+.0010+.0001 = 1.08%
7 .0041+.0004 = 0.45%
8 .0017+.0002 = 0.19%
9 .0007+.0001 = 0.08%
10 .0003 = 0.03%
11 .0001 = 0.01%

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.