Re: Language Tagging within Unicode

New Message Reply About this list Date view Thread view Subject view Author view

From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Tue Feb 04 2003 - 05:46:18 CST


In <3E3EC4E0.7080801@Sonietta.blilly.com> Bruce Lilly <blilly@erols.com> writes:

>Charles Lindsey wrote:
>>
>> (However, there is no defined list of those "special protocols".)

>There is; one of the Unicode documents related to the introduction
>specifically mentions ACAP as the intended use, ... .
>The full document is: http://www.unicode.org/unicode/reports/tr20/index.html
>Martin is listed as one of the co-authors. The relevant section is

Martin himself has given his opinion on this list, and it flatly
contradicts yours.

>"
>3.10 Language Tag Characters, U+E0000 .. U+E007F

>.... They were solely
>included for the benefit of those Internet protocols, such as ACAP, which
                                                       ^^^^^^^^^^^^
>require a standard mechanism for marking language in UTF-8 strings ... .

How can you interpret "such as ACAP" as providing a 'defined list of those
"special protocols"', let alone as specifying that no other protocol is
eligible? As Martin has said, the feature is available for those internet
protocols whoe defining documents care to use it. That may or may not
turn out to include Usefor.

>In this case proper recognition requires that all recievers use an
>implementation of Unicode 3.1 or 3.2; earlier versions definitely
>won't work, and later versions might not work.

Wow! So Unicode 3.1 is not forwards compatible with Unicode 3.0! Why ever
were they allowed to introduce a new feature that would break existing
software. The IETF would never have allowed that, would they?

> Maintaining the
>tags not only requires a Unicode editor, it requires an editor which
>properly handles the 3.1/3.2 language tags when modifying text -- I
>am not aware of any such editor.

You don't need a Unicode Editor to display Unicode (with or without
language tags). If the feature to use these tags is present, and users
want to take advantage of that feature, then they will find ways to
generate them.

>> OTOH, within the Email community, I find the following within RFC 2231,
>> when introducing a language-tag extension to RFC 2047:
>>
>> In the future it is likely that some character sets will provide
>> facilities for inline language labeling. Such facilities are
>> inherently more flexible than those defined here as they allow for
>> language switching in the middle of a string.

>In the sense of character sets as used in RFC 2231, there is no language
>lebeling provided by any character set; Unicode 3.1 is not a character
>set in that sense -- utf-7 is, but utf-7 (nor any other charset) does
>not provide language labeling.

Eh? UTF-7, UTF-8 and UTF-16 are all charsets, and they _all_ provide
language tagging via the mechanism provided by Unicode 3.1. That is
_exactly_ the state of affairs foreseen by RFC 2231.

>> If and when such facilities are developed they SHOULD be used in
>> preference to the language labeling facilities specified here. Note
>> that all the mechanisms defined here allow for the omission of
>> language labels so as to be able to accommodate this possible future
>> usage.

>As noted, it hasn't happened yet.

So? I rather doubt whether the language tagging for parameters defined by
RFC 2231 has actually happened yet. Have you ever seen an example in the
wild?

>> Effectively, one would say "It will usually be unnecessary to use language
>> tagging within headers but, if it is considered necessary, then the
>> language tagging defined for Unicode MAY be used" (note that the
>> contexts where this would be applicable are all for human consumption).

>That is not applicable in message or MIME-part header fields and
>would not comply with the requirements of RFC 2277 for those
>fields.

But you seem to be agreeing that it would be applicable in top-level
headers. As to whether it would be applicable in part header fields, that
depends on what happens to part level fields in Netnews, which is an issue
yet to be addressed.

> Because the Unicode documents expressly forbid use of the
>language tags with MIME, they would be inappropriate for body text
>in a MIME message, and there is no way to get body content other
>than plain text in US-ASCII with a 7bit transfer encoding except
>via MIME.

And yes, that is true for bodies, but we were discussing headers, in case
you had forgotten.

>> So is that an allowable usage according to Unicode 3.2? Observe that
>> headers using raw UTF-8 will not be using any MIME protocol.

>By definition, no header fields may use raw UTF-8.

Eh? We are discussing the situation that would arise IF usage of raw UTF-8
in header fields WERE to be defined. So you cannot argue against it on
those grounds, otherwise your argument is circular.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl@clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.