[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PaceFeedEquivalence




On Jul 29, 2004, at 8:27 PM, Joe Gregorio wrote:


http://www.example.com:80/feed.atom
http://www.Example.COM/feed.atom

I guarantee that some apps, in particular web spiders, are going to get
very aggressive about schema-specific normalization and regard these as
equivalent, and I don't see any reason to rule this out. Norm? -Tim

First, "Simple String Comparison" is the strictest form of URI comparison that you can do. That is, as a means of determining if two atom:ids are the same it is the strictest test possible. Your example, and all other ways of comparing URIs, will be looser than "Simple String Comparison". And that means that no matter how you compare the URIs in practice, you will always find the ones that are the same. I believe that would be plus for interoperability.

Please try again Joe. I must be having a stupid evening, I read that paragraph three times and I can't figure out what you're trying to say.


That is, if I am a producer and I produce two feeds
with the same entry in each but they
have atom:ids of

   http://www.example.com:80/feed.atom
and
   http://www.Example.COM/feed.atom

respectively then as a producer I should not be suprised
if an aggregator treats them as different entries.

Right, and as a producer that would be bad practice. The interesting (and not uncommon in my experience) case is when you pick those up on two different servers and are wondering if they're really the same. We're not going to *force* people to be heroic about doing scheme-specific comparison, but there's no point writing rules against it.


Second, the assertion about spiders is a bit
odd, are you asserting there are spiders out
there that know nothing of the Atom format
yet will try to determine the equivalence of
'entries' based on atom:id? Or are you asserting
there are spiders that will know of the Atom format
and will knowingly ignore the specification?

I am 100% certain that there will be atom-savvy spiders, based on the fact that there are already RSS-savvy spiders. Large scale spiders (I've written two) go to immense lengths to spot duplicate URLs because there are a lot of them out there, and you can do immense amounts of computation in the time it takes to fetch one, so it's very cost-effective. I guarantee that every serious search-engine spider on the planet is already doing this kind of HTTP-specific smart comparison. There's no earthly reason to write a rule against it. It would also be unreasonable to *require* anything smarter than string comparison, but as URIs get passed around the infrastructure, shit happens, and we shouldn't get in the way of cleanup attempts. -Tim