[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Atom Link Extensions Use Case



Why not just drop an element into the <entry> in your own namespace?  This doesn’t feel like any kind of a link to me.

<feed xmlns:loc="http://whatever.loc.gov">
  ...
  <entry>
    ...
    <loc:checksum>3c89ea593c01483fd091</loc:checksum
    ...

On Fri, Jun 8, 2012 at 6:04 AM, Ed Summers <ehs@xxxxxxxxx> wrote:

Hi all,

I am using Atom to syndicate access to data dumps at the Library of
Congress. We have a web application that provides access to historic
newspapers [1], and we have received requests for access to the
underlying OCR data for research and commercial purposes. Despite the
fact that this is historic data, we are routinely adding new content
as it is digitized. Rather than require clients to issue millions of
requests to get at the OCR data (which is actually web addressable)
the plan is to periodically create a tarred and compressed dump file
of new OCR content, and publish the availability of the file in an
Atom feed, which interested parties can subscribe to. It's a similar
model to what Wikimedia does for various Wikipedia projects [2].

Here's a minimal example, to give you an idea of what I mean (warning
URLs don't currently resolve):

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
   <title>Chronicling America OCR Dumps</title>
   <link rel="self" type="application/atom+xml"
href="" href="http://chroniclingamerica.loc.gov/dumps/ocr/feed/" target="_blank">http://chroniclingamerica.loc.gov/dumps/ocr/feed/" />
   <id>info:lc/ndnp/dumps/ocr</id>
   <author>
       <name>Library of Congress</name>
       <uri>http://loc.gov</uri>
   </author>
   <updated>2012-06-08T08:35:27-04:00</updated>
   <entry>
       <title>part-00001.tar.bz2</title>
       <link rel="alternate" type="application/x-bzip2"
href="" href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2" target="_blank">http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
/>
       <id>info:lc/ndnp/dump/ocr/part-00001.tar.bz2</id>
       <updated>2012-06-07T13:57:23-04:00</updated>
       <summary type="xhtml"><div
xmlns="http://www.w3.org/1999/xhtml">OCR dump file <a
href="" href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2" target="_blank">http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2">part-00001.tar.bz2</a>
with size 162.7 MB generated June 7, 2012, 1:57 p.m.</div></summary>
   </entry>
</feed>

So the reason why I am writing here is that I would like to add
checksum information to the feed to let clients verify that they have
downloaded the data dump file correctly. An argument could be made
that it's not necessary since a corrupted bz2 file would likely not
decompress. An argument could also be made that the Content-MD5 header
could be used. But I like the idea of making an explicit assertion
about the checksum in the Atom document.

After a bit of googling I ran across James Snell's Atom Link
Extensions draft, which provides a pattern for including an md5
checksum in the <link> element like so:

   <link rel="alternate" type="application/x-bzip2"
hash="md5:579758192095fde80896058af4ce0aee"
href="" href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2" target="_blank">http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
/>

Unfortunately it looks like the draft has expired. I was wondering:

a) are there other established patterns for adding checksum
information for resources in Atom
b) if it's worth it for James to update the draft and try to push it
forwards to an Informational status

As more and more data providers make dumps of their data available to
reduce crawling (like Wikipedia) it seems like a good use case for
Atom to support.

//Ed

[1] http://chroniclingamerica.loc.gov
[2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml-rss.xml
[3] http://tools.ietf.org/html/draft-snell-atompub-link-extensions-08