[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A simpler proposal: PaceAtomIDAsString
Paul Hoffman / IMC wrote:
Greetings again. The earlier thread on atom:id made it clear that people
really, really want to compare things consistently, and that using URIs
might get in the way of that. Thus, I have made a simplifying proposal
at <http://intertwingly.net/wiki/pie/PaceAtomIDAsString>.
It explicitly allows URIs for people who want URIs, and allows plain
text for the rest of us.
According to the pace, it's not plain text for the rest of us, it's
Unicode for the rest of us. If the comparison rules are to be
character based over Unicode, that would suggest the Unicode
encoding must also be specified so we know how many bytes are being
used per character and so on. A 'string of Unicode characters' isn't
sufficient by itself. Perhaps the Pace is using "character" in a
non-Unicode sense (if so, that's confusing). If the intention is
really to compare on Unicode characters/codepoints independent of
the encoding scheme (ie along the lines of some kind of Unicode
'Infoset'), that needs to be said - it's certainly not plain text
anymore.
But:
"Even if a particular atom:id instance looks like a URI, it SHOULD
NOT be treated as one."
doesn't seem consistent with the idea that we can have URIs if we
want. The pace is saying (to me) something along these lines "if it
looks like a HTTP URL you shouldn't run GET on it; if it looks like
a NewsML URI you shouldn't infer from the versioning bits". We have
experience that suggests people will not be able to oblige
themselves to this kind of constraint. IMO it should be struck.
Generally I'm concerned that we will use URI scheme structures to
generate unique keys (they're handy for that) and thus end up with
Unicode strings that look like URIs but are not supposed to be
treated as URIs, ever where those URIs have comparison and
normalization rules. I believe this is maximally surprising.
May I suggest this wording for the time being:
" The "atom:id" element's content conveys a permanent,
globally unique identifier for the feed. It MUST NOT change over
time, even if the feed is relocated. An atom:head element MAY
contain an atom:id element, but MUST NOT contain more than one. The
content of this element, when present, MUST be a string of Unicode
characters encoded as UTF-8. When atom:id elements are compared,
they MUST be compared on a character-by-character basis.
It is not a goal that atom:id be usable for retrieval of
information.
Historically, in syndication feeds, the detection of
duplicates has been error-prone because of failure to assign
identifiers which are globally unique and stable. Identifiers have
been observed to change when a feed moved hosts or when an entry was
reassigned to a different category or its title was edited. The
management of globally unique and immutable identifiers requires
prior planning and extra effort, but this is more than justified by
the benefits of robust duplicate detection. "
Of course this begs the question - what if the Atom is encoded as
UTF-16?
In all seriousness, perhaps I just don't understand the pace.
cheers
Bill