From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Mon Feb 24 2003 - 16:11:02 CST
In <3E58F9A8.5040700@Sonietta.blilly.com> Bruce Lilly <blilly@erols.com> writes:
>All valid utf-8 sequences begin with one octet in the range
>0xc0 - 0xdf and are followed by octets in the range 0x80 - 0xbf.
>The widely used iso-8859 charsets also include all of 0xc0 - 0xdf
>and half of the 0x80 - 0x9f range, viz 0xa0 - 0xbf. Therefore
>fully half of all valid utf-8 sequences are also valid iso-8859
>sequences.
But that is again the wrong question. What you need to ask is what
percentage of iso-8859 sequences that are also valid utf-8 sequences.
If you take a random sequence of iso-8859 characters, ingoring the CTLs
(0x00-1f) and the extra CTLs (0x80-9f), then it is easy to calculate the
probability that it might be a valid UTF-8 sequence, according to the
length of the sequence. Assume that Unicode points above U+10ffff are
illegal, and forget about the absent surrogate cases.
If P(n) is the probability that a sequence of length n was valid UTF-8,
then the formula is
P(n) = 96/256 * P(n-1) = .3750 * P(n-1)
+ 32*32/(256^2) * P(n-2) = .0156 * P(n-2)
+ 16*32^2/(256^3) * P(n-3) = .0010 * P(n-3)
+ 8*32^3/(256^4) * P(n-4) = .0000 * P(n-4)
The results are as follows:
length probability
of seq of valid utf-8
0 1 = 100%
1 .3750 = 37.50%
3 .1406+.0056 = 14.62%
4 .0548+.0059+.0010 = 6.17%
5 .0231+.0023+.0004 = 2.58%
6 .0097+.0010+.0001 = 1.08%
7 .0041+.0004 = 0.45%
8 .0017+.0002 = 0.19%
9 .0007+.0001 = 0.08%
10 .0003 = 0.03%
11 .0001 = 0.01%
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5