[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PaceFeedEquivalence



On Thu, 29 Jul 2004 21:10:59 -0700, Tim Bray <tim.bray@xxxxxxx> wrote:
> On Jul 29, 2004, at 8:27 PM, Joe Gregorio wrote:
> 
> >> http://www.example.com:80/feed.atom
> >> http://www.Example.COM/feed.atom
> >>
> >> I guarantee that some apps, in particular web spiders, are going to
> >> get
> >> very aggressive about schema-specific normalization and regard these
> >> as
> >> equivalent, and I don't see any reason to rule this out.  Norm?  -Tim
> >
> > First, "Simple String Comparison" is the strictest form
> > of URI comparison that you can do. That is,
> > as a means of determining if two atom:ids are the same
> > it is the strictest test possible. Your example,
> > and all other ways of comparing URIs, will be looser than
> > "Simple String Comparison". And that means that no
> > matter how you compare the URIs in practice, you will
> > always find the ones that are the same. I believe that
> > would be plus for interoperability.
> 
> Please try again Joe.  I must be having a stupid evening, I read that
> paragraph three times and I can't figure out what you're trying to say.
> 

> I am 100% certain that there will be atom-savvy spiders, based on the
> fact that there are already RSS-savvy spiders.  Large scale spiders
> (I've written two) go to immense lengths to spot duplicate URLs because
> there are a lot of them out there, and you can do immense amounts of
> computation in the time it takes to fetch one, so it's very
> cost-effective.  I guarantee that every serious search-engine spider on
> the planet is already doing this kind of HTTP-specific smart
> comparison.  There's no earthly reason to write a rule against it.  It
> would also be unreasonable to *require* anything smarter than string
> comparison, but as URIs get passed around the infrastructure, shit
> happens, and we shouldn't get in the way of cleanup attempts. -Tim
> 

I'll answer both of the above with a question: What kind of 
URI normalizations should be used when comparing
atom:id URIs for equivalence?

"Case Normalization"
"Percent-Encoding Normalization"
"Path Segment Normalization"
"Scheme-based Normalization"
"Protocol-based Normalization"

That is for the *consumer* side of 
atom:id. On the producer side should
we suggest/require the canonical form
of URIs[1]? And if so should we
augment that canonicalization
to include a unicode normalization form?

    -joe

[1] http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html#canonical-form

-- 
Joe Gregorio        http://bitworking.org