From: Martin Duerst (duerst@w3.org)
Date: Sun Mar 09 2003 - 12:34:46 CST
At 13:53 03/03/08 -0500, Bruce Lilly wrote:
>Charles Lindsey wrote:
>>No. Language tags are composed from the letters A-Z and a-z, and the
>>digits 0-9 and minus/hyphen ('-'). There is no requirement that they be
>>expressed in ASCII. Indeed, to quote from RFC 3066:
>> Language tags may always be presented using the characters A-Z, a-z,
>> 0-9 and HYPHEN-MINUS, which are present in most character sets, so
>> presentation of language tags should not have any character set
>> issues.
>>And it just so happens that that within the language-tagging part of
>>Unicode there are code-points for the letters A-Z and a-z, and the
>>digits 0-9 and minus/hyphen ('-'), so the RFC 3066 tags are easily
>>represented - indeed, the Unicode standard references RFC 3066 as the
>>expected tagging system to be used with its language tags.
>
>Utter nonsense. That says that *because* the tags are always ldh,
>there is no issue with representing them in any charset. Here's what
>3066 says in detail:
>
>2.1 Language tag syntax
>
> The language tag is composed of one or more parts: A primary language
> subtag and a (possibly empty) series of subsequent subtags.
>
> The syntax of this tag in ABNF [RFC 2234] is:
>
> Language-Tag = Primary-subtag *( "-" Subtag )
>
> Primary-subtag = 1*8ALPHA
>
> Subtag = 1*8(ALPHA / DIGIT)
>
> The productions ALPHA and DIGIT are imported from RFC 2234; they
> denote respectively the characters A to Z in upper or lower case and
> the digits from 0 to 9. The character "-" is HYPHEN-MINUS (ABNF:
> %x2D).
>
>RFC 2234's definitions of ALPHA and DIGIT are quite clear, and
>they are absolutely a subset of ASCII and not some oddball
>multibyte codes.
Please allow me to cite from http://www.ietf.org/rfc/rfc2234.txt:
2.3 Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF a character is merely a non-negative integer.
In certain contexts a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
So neither RFC 2234 nor RFC 3066 say that the language tags have
to be encoded in ASCII. Indeed, it would be a bad idea if they did.
As an example, in an HTML document with charset=UTF-16, language
codes obviously are encoded in UTF-16. Similar for EBCDIC-encoded
documents.
Up to here, we still are working with the repertoire of characters
in ASCII. To get to the Unicode plane 14 tag codes, we just map the
ASCII codes to some other codes, for our specific protocol. This is
done all too often. RFC 2047, with base64 and qp, is a typical example.
Not that I like the Unicode plane 14 tag codes, but then I don't
like RFC 2047 either. And I think the main issue is whether we need
language codes, not how to encode them. See below.
>>>Step 1 is not done (strictly speaking it's not a requirement for
>>>names, only for text strings).
>>
>>Indeed, if we consider I18N of newsgroups-names only, Step 1 is not even a
>>requirement.
>
>It remains a strong recommendation.
And common sense makes clear that it would be a start for desaster.
Consider the newsgroup chat.rec. Is that about chatting, or is it
about cats, in French? Do we want to somehow tag it on the side
to make the difference? Do you think you'll get the users to
understand what's going on?
Regards, Martin.