Re: Unicode and draft 07

New Message Reply About this list Date view Thread view Subject view Author view

From: Clive D.W. Feather (clive@demon.net)
Date: Wed May 29 2002 - 05:15:54 CDT


Charles Lindsey said:
> Some of the following is off-list email between Clive and myself.

It was ? This was unintentional.

>>> Concerning the first NOTE in 4.4.1
>> However, can we change the start to just:
>>
>> UTF-8 is an encoding for the Unicode character set with ...
>>
>> or
>>
>> UTF-8 is an encoding for large character sets with ...
>>
>> since it's in no way limited to, or based on, 16 bits.
>
> Eh? That's why the NOTE in 4.4.1 includes the words "(and even 32 bit)".

Yes, but someone reading it will get the impression that 32 bits is an
add-on. It isn't: UTF-8 is designed for the entire 31 bit space of Unicode.

> The new draft for UTF-8 claims it is suitable for UCS-2 and UCS-4,
> though I doubt anyone in the real world uses UCS-4.

Not an argument I want to get into. Especially since the characters above
U+FFFF are relatively new.

> Anyway, I now propose:
>
> NOTE: UTF-8 is an encoding of the 16bit UCS-2 (and even the 32bit
> UCS-4) character sets ...

I really think that's too much detail for our document. We don't care about
UCS-2 or UCS-4 (which aren't Unicode terms anyway). We mostly talk about
just Unicode, so I continue to think that:

    UTF-8 is an encoding for the Unicode character set with ...

is far clearer.

>>> Even the semantics I have written
>>> for 2.4.2 seems OK. You MUST NOT generate the illegal stuff, you MUST
>>> NOT interpret it as having any valid meaning, but you MAY pass it on
>>> if you can't be bothered to check it (as relayers certainly won't with
>>> stuff that they are not interested in).
>> Don't you need to add "or replaced by other characters" at the end of that
>> paragraph ?
>
> Why? Perhaps it should say they MUST NOT be decoded, and perhaps "valid"
> should be changed to "meaningful". So I now have:
>
> ...but they MUST NOT ever be decoded or otherwise interpreted as
> meaningful characters.

Okay.

In a recent discussion I suggested that a bad UTF-8 sequence could be
replaced by U+FFFD. You, I think, rejected that idea, so I was trying to
rule it out. However, it's actually *suggested* by the Unicode specs:
<http://www.unicode.org/unicode/faq/utf_bom.html#15>
so your wording is right, I think.

>>> So our definitions of combiner-extended and combiner-mark are
>>> essentially wrong.
>> After a lot of study of the Unicode standard, I think I agree.

> Yes, I spent best part of a day trying to find where we had got that
> syntax from, and I eventually tracked it down in Unicode TR 15, which
> deals with normalization. Essentially, our present syntax follows the
> features necessary for the understanding of the two normalizations NFC
> and NFKC.

This is, indeed, where we got the ideas from - we were trying to ensure
that each name can only occur in one encoding. However, this is dealt with
by the NFKC-invariance requirement and *not* by the syntax.

Delving into my email archive (fascinating in itself) I find that this bit
of the syntax was introduced to ensure that you didn't put accents on the
dots between components. At one point I proposed the words:

    Names are restricted to those that are invariant under Unicode
    normalization NFC; each component must furthermore begin with
    a character with a combining class of 0.

Then you pointed out that this sort of thing was better done in syntax than
in semantic wording. I agreed.

>>> What we should have said is that a combiner-extended
>>> is everything excluding Cc, Cf, Cs, Zs, and Zp (as at present) plus the
>>> Marks Mn, Mc and Me (no need to mention combining classes).
>>> Then, our definition of combiner-mark should have been just the Marks
>>> Mn, Mc and Me (whatever their combining class).
>> Right, except that the names are now wrong.
>>> So no component glyph can start with a Mark (I think that is a
>>> fundamental Unicode property).
>> Almost. There is a definition (D17a) for a defective sequence that does so.
> Eh? I don't see that. D17A is right in the middle of the Hangul range.

No, not U+D17A, but Unicode Definition 17a, page 43 (chapter 3) of the
Unicode standard (only available in PDF).

>>> Marks are reserved for modifying the
>>> things preceding them

>> I think I agree. The relevant text in 5.5 should change to:
>>
>> component-glyph = combiner-base *combiner-mark
>> combiner-base = combiner-ASCII / combiner-extended
>> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
>> combiner-extended = <any character with a Unicode code value of
>> | 0080 or greater,
>> but excluding any character in Unicode
>> | categories Cc, Cf, Cs, M, and Z>
>> combiner-mark = <any character with a Unicode code value of
>> | 0080 or greater and in Unicode category M>
>>
>> and in the following note change "(Zs, Zl, Zp)" to "(Z)".
>>
>> I've deliberately used M and Z instead of listing the minor categories,
>> just in case anything new gets added.
>
> Right. That is one possibility. Its effect would be to exclude certain
> currently allowed component glyphs that began with a Mark (i.e.
> something that affected the following base-character instead of the
> previous one).

Right. More precisely, it includes marks as part of the component glyph
they follow; this affects the interpretation of the length limits further
down the section.

> According to our understanding, such beasts are not
> supposed to exist, so in fact there should be no change.

Right.

> However, there is a less dangerous solution

I don't see why it's "less dangerous" or a "solution".

> which I have adopted for
> now, and that is simply to rename those syntax rules so that it is
> clear they derive from the Normalization requirements, and to add an
> explanatory note. So I propose to rename
>
> combiner-base as glyph-starter
> combiner-mark as glyph-marker
> combiner-ASCII as ASCII-starter
> combiner-extended as extended-starter

I have no problem with this. Though my reading suggests that a global
change from "glyph" to "grapheme" or "grapheme-cluster" (though that's a
bit long) might be a good idea.

> and to leave the syntax rules otherwise as they stand.
>
> Then the following NOTE would be changed to:
>
> NOTE: a component-glyph (a glyph-starter followed by some number
> of glyph-markers) is the basic unit upon which the Unicode
> normalizations NFC and NFKC operate; it is, to all intents and
> purposes, what a user might regard as a single "character" as
> displayed on his screen, though it might be transmitted as several
> actual characters (e.g. q-circumflex is two characters). Note also
> that, in some writing schemes, several component-glyphs will merge
> into one visible object of variable size.

The problem is that, in many cases, other markers are part of the single
glyph. One issue we discussed last year was the enclosing-circle; we
accepted that there were reasons why a group name might use this, it's a
Mark, we don't want it at the start of a component, it doesn't alter the
amount of screen space the character takes up (to a first approximation),
but its combining class is zero.

The syntax changes were done while we were still understanding
normalisation and character construction. I really suggest (re-)reading
items D13 to D17a on page 43 of the Unicode Standard. It's clear that a
basic grapheme (a "combining character sequence") is a character not of
category M (a "base character") followed by zero or more characters of
category M ("combining characters"). Therefore the correct test is
category M/not-M, not class 0/not-0.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive@davros.org>  | Fax:  +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            | NOTE: fax number change


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.