[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Letter from Planet Web on identifiers

I disagree with the direction Atom's going on links and identifiers. This note is

(a) to explain why I think this notion of atom:id for atom:entry is
misguided and unhelpful, and
(b) to acknowledge that it's probably not actively harmful and say
"let's move on", and
(c) to ask for specific wording change in the description of link and id.

In this discussion, I should start by acknowledging my bias: I'm a Web guy. The Web has been at the centre of my professional life for ten years now, and I think that people who are designing protocols for global-scale information systems should try to understand the Web's lessons. You don't have to make all the same design choices, but you really should understand why it works the way it does so you can understand the cost and benefits of differing choices.

1. How the Web Actually Works

On Feb 28, 2004, at 12:14 AM, Roger B. wrote:

One of those reasons, by the way,
is the trifling fact that substantially all of the existing Web
software that actually works - browsers, caches, servers, spiders, all
that boring stuff - is currently built around that assumption.

that's definitely
an assumption that flies in the face of the dynamic Web. :) Query-strings,
cookies, and sessions effectively nuke any notion that an http: URI alone
can accurately identify a resource.

I don't have the time to do code walk-throughs on the crawlers and indexers and caches and middleware that make the Web actually useful; but the people who wrote them universally believe that Web identity is what goes with a URI and nothing else. For some gnarly details on how real software tries to guess whether or not two URIs are actually the same, see http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html#comparison. [Oh, with respect to Roger's note: query-strings are a string of characters that make up part of a URI and are opaque to all the middleware; cookies do muddy the waters but have the virtue that middleware can safely ignore them; and the Web doesn't do sessions, it uses stateless protocols].

And, by the way, ever since about fifteen minutes after the dawn of the Web, voices have been raised in horror at this conflation of naming and addressing, and saying There Must Be A Better Way; so this is not a new issue. More on that below.

2. Why I Dislike atom:id

I asked on a couple of occasions what the real applications for atom:id were, and here are the answers I got:

(a) I move my weblog somewhere else; with atom:id, the first time someone's
aggregator visits the new site they don't see old postings as new.
(b) My software decides to change the way it assigns URIs to items; with atom:id,
the first time someone's aggregator visits the changed site they don't see old
postings as new.
(c) I publish the same article in two different locations, and decide to create
two copies with different URIs rather than just point twice to the same URI.
With atom:id, the aggregator doesn't see this twice.
(d) Something gets passed through several levels of syndication, e.g. a
Reuters story, and various users decide to make their own URIs for it; with
atom:id, the aggregator sees each once.

In my opinion, each of (a), (b), and (c) are bad, poorly-thought-out, Web-hostile practices. They decrease the usefulness of search engines and caches and bookmark facilities, i.e. of the Web as a whole. Furthermore, it seems that in each case, the benefit (a one-time avoidance of the sight of duplicate links) seems pretty minor. I'm still baffled when people say "I need to have multiple URIs so I can publish in multiple categories". Er... pointers? Computer programmers have been doing call-by-reference for some decades now, so I don't get it.

By the way, I too see dupes in my RSS feed, and in virtually every case I can trace them to incompetence or stupidity in the CMS or production system upstream, and I have my doubts that people who already can't manage to detect dupes are going to be helped by an atom:id in the interface protocol.

Furthermore, I'm nervous about these use-cases where the author seems to think the item in two different places is the same item. For example, I asked Norm Walsh why he wanted to do this and he wrote:

Because I want to provide different CSS or different navigation links
depending on which site you read them from. In other words, I want the
essays to "fit in" with the context in which they are presented and I
want to present them in two different contexts.

So Norm, you're changing the styles (which BTW can suppress whole <div>s) and the links (a defining part of web content) and the context and you're really sure I should still think these are the same thing? Think you could leave that choice up to me?

The strongest remaining use-case I see is the Reuters-article-in-two-places case, and it still doesn't seem very strong to me. To start with, for general news I'd rather go through something like Topix or Google that is going to take care of that anyhow. If I actually did subscribe to a bunch of general-purpose newsfeeds I'd probably be interested in seeing who picks up which stories (The same article appearing in the New York Times and the National Review is just not the same thing unless you're politically illiterate). Hold on, I *do* subscribe to a bunch of general-purpose newsfeeds and I just don't have this problem. I *do* have the problem with idiotic publishers republishing the same content time after time under different URIs rather than just bumping the modification date. So I'm unconvinced that (a) there's actually a problem here that requires inventing a whole new level of identity semantics and (b) if there were, atom:id would help.

So, from where I sit, atom:id is going to provide a very moderate benefit while encouraging Web-hostile bad publishing practices. So I don't think it should be in Atom.

3. Why It's OK

As I said, ever since the URL/URI notion came along, lots of people have objected (as we've seen in this list), saying "mixing up naming and addressing is just wrong." I have a lot of sympathy with this point of view. Unfortunately it leads nowhere. Phil Karlton, one of the smartest programmers I ever met, said "there are only two hard problems in Computer Science: naming and cache invalidation." URIs are the first global-scale data naming system that has ever worked in the slightest. It isn't perfect but it kind of limps along.

So what people do about this is they think "we're smarter than those Web morons, we'll do identifiers right!" And they go off and found an IETF working group (lots of those corpses litter the landscape) or they invent tag: or URNs or doi: or atom:id or whatever, under the assumption that naming is a technical problem.

History shows that doing names properly is a management problem, not a technical problem. An organization that is competently managed will make the effort to ensure that their identifiers are persistent, stable, or available. An organization that is stupidly managed will screw this up even if they're using URNs.

And after all these years, the only URL alternative that's getting reasonably widespread deployment (that I know of) is the WebDAV URI schemes, which are aimed at solving a different class of problems. Last time I checked, my well-equipped computer here has no software that can do anything useful with URNs or tag: or doi: or any of the others.

So, it's OK to have atom:id. Organizations that publish feeds, if they're competent, will eventually realize that their stuff will be more accessible and more popular and more influential if they don't fuck with their URIs and don't get in the way of Google and Akamai and all the other middleware doing their work properly. So they'll just let the URI be the identifier and the identifier be the URI and get on with life.

3. Request

I'll do this formally on the Wiki and post here again. But I want to change the wording of the description of link and id as follows

4.13.2: The "atom:link" element is the URI for this entry, seen as a Web Resource. An entry must have one and exactly one atom:link.

[End of story. I totally want to get rid of the "alternate" stuff, which is a "wouldn't this be nice" feature].

4.13.2: The "atom:id" element is an assertion of globally unique identity. That is to say, if two different entries (in the same feed or different feeds) have the same atom:id, this constitutes an assertion that the two entries are the same.

[Recognizing that "identity" is often a matter of opinion and context. And it worries me that the atom:id isn't attached to some provider so I can ask "who said so?"].


Attachment: smime.p7s
Description: S/MIME cryptographic signature