From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Fri Jul 26 2002 - 05:40:24 CDT
In <3D3FE304.1090108@certplus.com> Jean-Marc Desperrier <jean-marc.desperrier@certplus.com> writes:
>Charles Lindsey a dit :
>>+ attempt to interpet the header according to whatever other character
>>+ set can be deduced, or has been configued as a default by the reader.
>>
>configured.
>>! NOTE: It is possible to determine, with a high degree of
>>! accuracy, when a given text containing octets with the 8th bit
>>! set was not encoded using UTF-8, and using this test to recover
>>! such non-compliant texts is therefore commended where no other
>>! harm could arise.
>>
>Detection that the texte was not encoded as UTF-8 has 100% accuracy.
No, I think you have got it the wrong way around. If a text is correctly
encoded as UTF-8, then it will be compliant with the UTF-8 spec, and hence
the test will report it as valid UTF-8 100% of the time.
But if a text is encoded in something else (big5, for example) then there
is still a small probability that the test will report it as valid UTF-8,
but a much larger probability that it will be reported as not UTF-8.
Which is exactly what my NOTE says.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5