[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A simpler proposal: PaceAtomIDAsString




Paul Hoffman / IMC wrote:



Greetings again. The earlier thread on atom:id made it clear that people really, really want to compare things consistently, and that using URIs might get in the way of that. Thus, I have made a simplifying proposal at <http://intertwingly.net/wiki/pie/PaceAtomIDAsString>.

It explicitly allows URIs for people who want URIs, and allows plain text for the rest of us.

According to the pace, it's not plain text for the rest of us, it's Unicode for the rest of us. If the comparison rules are to be character based over Unicode, that would suggest the Unicode encoding must also be specified so we know how many bytes are being used per character and so on. A 'string of Unicode characters' isn't sufficient by itself. Perhaps the Pace is using "character" in a non-Unicode sense (if so, that's confusing). If the intention is really to compare on Unicode characters/codepoints independent of the encoding scheme (ie along the lines of some kind of Unicode 'Infoset'), that needs to be said - it's certainly not plain text anymore.


But:

"Even if a particular atom:id instance looks like a URI, it SHOULD NOT be treated as one."

doesn't seem consistent with the idea that we can have URIs if we want. The pace is saying (to me) something along these lines "if it looks like a HTTP URL you shouldn't run GET on it; if it looks like a NewsML URI you shouldn't infer from the versioning bits". We have experience that suggests people will not be able to oblige themselves to this kind of constraint. IMO it should be struck.

Generally I'm concerned that we will use URI scheme structures to generate unique keys (they're handy for that) and thus end up with Unicode strings that look like URIs but are not supposed to be treated as URIs, ever where those URIs have comparison and normalization rules. I believe this is maximally surprising.

May I suggest this wording for the time being:


" The "atom:id" element's content conveys a permanent, globally unique identifier for the feed. It MUST NOT change over time, even if the feed is relocated. An atom:head element MAY contain an atom:id element, but MUST NOT contain more than one. The content of this element, when present, MUST be a string of Unicode characters encoded as UTF-8. When atom:id elements are compared, they MUST be compared on a character-by-character basis.


It is not a goal that atom:id be usable for retrieval of information.

Historically, in syndication feeds, the detection of duplicates has been error-prone because of failure to assign identifiers which are globally unique and stable. Identifiers have been observed to change when a feed moved hosts or when an entry was reassigned to a different category or its title was edited. The management of globally unique and immutable identifiers requires prior planning and extra effort, but this is more than justified by the benefits of robust duplicate detection. "

Of course this begs the question - what if the Atom is encoded as UTF-16?

In all seriousness, perhaps I just don't understand the pace.

cheers
Bill