[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Link Extensions. Need "md5" or some kind of hash.



FYI: The new RFC5854 re: "Metalink Download Description Format" defines a "hash" element in its section 4.2.4. They have chosen to use the method of storing the hash algorithm in a different element than the hash value. That is, in my opinion, unfortunate. I do not suggest that we follow their lead. I'm just posting this here so that folk can be aware of what others are doing in other contexts.

The relevant text from RFC5854 follows:
4.2.4.  The "metalink:hash" Element

   The "metalink:hash" element is a Text construct that conveys a
   cryptographic hash for a file.  All hashes are encoded in lowercase
   hexadecimal format.  Hashes are used to verify the integrity of a
   complete file or portion of a file to determine if the file has been
   transferred without any errors.

   metalinkHash =
      element metalink:hash {
        attribute type { text }?,
        text
      }

   Metalink Documents MAY contain one or multiples hashes of a complete
   file. metalink:hash elements with a "type" attribute MUST contain a
   hash of the complete file.  In this example, both SHA-1 and SHA-256
   hashes of the complete file are included.

 ...
   <hash type="sha-1">a97fcf6ba9358f8a6f62beee4421863d3e52b080</hash>
   <hash type="sha-256">fc87941af7fd7f03e53b34af393f4c14923d74...</hash>
 ...

   Metalink Documents MAY also contain hashes for individual pieces of a
   file. metalink:hash elements that are inside a metalink:pieces
   container element have a hash for that specific piece or chunk of the
   file, and are of the same hash type as the metalink:pieces element in
   which they are contained.  Metalink Documents MAY contain one or
   multiple metalink:pieces container elements, if each "type" attribute
   of metalink:pieces has a unique value.

   metalink:hash elements without a "type" attribute MUST contain a hash
   for that specific piece or chunk of the file and MUST be listed in
   the same order as the corresponding pieces appear in the file,
   starting at the beginning of the file.  The size of the piece is
   equal to the value of the "length" attribute of the metalink:pieces
   element, apart from the last piece, which is the remainder.  See
   Section 4.1.3.2 for more information on the size of pieces.
   In this example, SHA-1 and SHA-256 hashes of the complete file are
   included, along with four SHA-1 piece hashes.

 ...
   <hash type="sha-1">a97fcf6ba9358f8a6f62beee4421863d3e52b080</hash>
   <hash type="sha-256">fc87941af7fd7f03e53b34af393f4c14923d74...</hash>
   <pieces length="1048576" type="sha-1">
     <hash>d96b9a4b92a899c2099b7b31bddb5ca423bb9b30</hash>
     <hash>10d68f4b1119014c123da2a0a6baf5c8a6d5ba1e</hash>
     <hash>3e84219096435c34e092b17b70a011771c52d87a</hash>
     <hash>67183e4c3ab892d3ebe8326b7d79eb62d077f487</hash>
   </pieces>
 ...


On Sun, May 16, 2010 at 11:29 PM, James Snell <jasnell@xxxxxxxxx> wrote:
Ok, although I seriously dislike having to do additional parsing on
attribute values, the arguments made so far are valid and parsing hex
encoded hash digests is -- fortunately -- quite simple to do. So let's
go with the following syntax...

 hash = attribute hash { hash-list }
 hash-list = # ( token ":" 1*HEX )

The token and HEX productions are defined by RFC2616...

The spec would defer to the existing IANA registry for hash functions
to define the "tokens"

This would result in a syntax of...

 hash="md5:abc...xyz, sha-1:123...567, sha-512:xyz...abc"

This seem acceptable to everyone?

- James

On Sat, May 15, 2010 at 11:46 PM, Sam Johnston <samj@xxxxxxxx> wrote:
> James,
> In consideration of former (CRC) and future (AHS) hashing functions I think
> it's critical to support extensibility and multiple hashes. I like that XML
> digsigs use anyURIs to identify hashes (e.g. <DigestMethod
> Algorithm="http://www.w3.org/2000/09/xmldsig#sha1">), but one could argue
> this unnecessarily complicates what should be a simple syntax.
> I was about to propose an IANA registry for hash functions but one already
> exists (Hash Function Textual Names as specified by RFC4572) so it would
> make sense to use it rather than inventing our own mechanism - even if we
> have to update the registry rules to allow for algorithms specified by URI
> rather than RFC.
> While Atom is an XML format and should arguably follow XML conventions,
> there is precedent for prefixing hashes with the name of the hashing
> function using e.g. colons or curly braces. I think it's more important to
> keep the XML syntax simple and in any case the hash and hash function should
> be tightly bound as they are useless independently.
> All that considered, I think the best approach is to allow for a
> multi-valued "hash" attribute ala:
> <link rel="alternate" href="" href="http://example.com/" target="_blank">http://example.com/"
> hash="md5:6705f99eccedeac20e969bef954c5fb0
> sha-1:bc608e6d3d339d1a7afc406a7ea6a8f07358038b" />
> and/or
> <link rel="alternate" href="" href="http://example.com/thing.pdf" target="_blank">http://example.com/thing.pdf"
> hash="md5:6705f99eccedeac20e969bef954c5fb0"
> hash="sha-1:bc608e6d3d339d1a7afc406a7ea6a8f07358038b" />
> Sam
> Google
> On Sat, May 15, 2010 at 1:15 AM, James Snell <jasnell@xxxxxxxxx> wrote:
>>
>> Good argument Bob... ok... stewing over this a bit more. I generally
>> dislike having to do additional parsing of attribute/element values
>> but there are very good reasons for keeping this as a single "hash"
>> attribute and you make a compelling case.
>>
>> On Fri, May 14, 2010 at 1:26 PM, Bob Wyman <bob@xxxxxxxx> wrote:
>> > James Snell <jasnell@xxxxxxxxx> wrote:
>> >> <link href="" md5="abc...xyz">
>> >>  <media:hash algo="GOST">123...456</media:hash>
>> >> </link>
>> >
>> > The alternative approach, which would support both a variety
>> > and multiplicity of hashes would look like this:
>> > <link href="" hash="gost:123123..., md5:0928402948...,
>> > sha256:098078097..."/>
>> > This strikes me as "simpler" than the hybrid approach. Just a few of my
>> > concerns with the proposed "hybrid" approach follow:
>> >
>> > I like binding the algorithm and value together into a single value
>> > since I
>> > know of no compelling case for processing one element in isolation of
>> > the
>> > other. The hash value only makes sense if you know the algorithm and the
>> > algorithm is only useful when bound to a specific hash value. Thus, it
>> > strikes me as simply introducing syntactic sugar to specify the
>> > algorithm
>> > using a different XML component than the value.
>> > These values are likely to be stored in databases and otherwise
>> > manipulated.
>> > In all cases, for the data to be meaningful, people will need to keep
>> > the
>> > binding between algorithm and hash value. It is likely that storing a
>> > single
>> > string value is going to be easier for folk than dealing with a
>> > multi-part
>> > value. Also, consider the effect of parsers... It is likely that in
>> > order to
>> > transfer a value from an entry into a database field, what you'll need
>> > to do
>> > is extract both algorithm and hash value from the parse tree and then
>> > construct some string that combines them. This would be particularly
>> > useful
>> > if you want to use the hash value as a database key (a very reasonable
>> > thing
>> > to do...) You could build and store the string "algo='GOST'>123...455<"
>> > or
>> > your database might support concatenated fields, or you could build
>> > "gost.123...456". I think I would go with the latter.
>> > Defining distinct attributes for each hash algorithm pushes unnecessary
>> > syntactical complexity to the global level and thus increases the
>> > complexity
>> > not only of the specification but also of all applications no matter
>> > which
>> > algorithms they understand or if they understand any at all. It also
>> > makes
>> > extending the list of supported algorithms "expensive" since such
>> > extensions
>> > require modification to the standard rather than just an registry
>> > entry.What
>> > benefit do we get from having these algorithm types defined at the
>> > global
>> > syntax level?
>> > The hybrid approach looks very complicated to me. It means that I'll
>> > have
>> > two very different places in which hash values might found and two very
>> > different syntaxes for expressing them. The result is going to be more
>> > complex code than would otherwise be the case. What value comes from
>> > using
>> > the hybrid approach?
>> > One argument for hybrid is that these elements exist already in other
>> > specs.
>> > I wonder if it isn't possible that those other specs might have
>> > approached
>> > the problem in a non-optimal fashion. Does it really make sense to
>> > import
>> > syntax if there isn't a really good case that demonstrates that doing so
>> > is
>> > the best approach?
>> > I am unaware of any hash algorithms that need anything other than the
>> > specification of the algorithm and the value in order to be useful. If
>> > there
>> > were broadly used algorithms that had more complex meta-data
>> > requirements,
>> > it would be easier to understand the appeal of the hybrid approach.
>> > I can't think of any reason why it is *useful* to separate the algorithm
>> > from the hash value. Can someone enlighten me here? What computation,
>> > storage or communication task becomes easier if you have these two
>> > separated?
>> >
>> > bob wyman
>> > On Fri, May 14, 2010 at 3:06 PM, James Snell <jasnell@xxxxxxxxx> wrote:
>> >>
>> >> Ok, I've been giving this some more thought and I think a hybrid
>> >> approach works very well. As has been pointed out a number of times in
>> >> this thread, there are existing elements in other namespaces that
>> >> provide a algorithm/hash pairing. I think that the Link Extensions
>> >> Draft can provide a attributes for the most basic hash algorithms and
>> >> applications that require hash algorithms that are not covered can
>> >> fall back to the extension elements.
>> >>
>> >> e.g.
>> >>
>> >> <link href="" md5="abc...xyz">
>> >>  <media:hash algo="GOST">123...456</media:hash>
>> >> </link>
>> >>
>> >> This would allow for the most common cases to be easily covered while
>> >> allowing for the full range of possible cases to be handled as well.
>> >>
>> >> - James
>> >>
>> >> On Wed, May 12, 2010 at 8:50 PM, Richard Salz <rsalz@xxxxxxxxxx> wrote:
>> >> >> So the key question is: what are the main algorithms we need to
>> >> >> provide attributes for?
>> >> >
>> >> > This is a hard question to answer -- especially for hash/digest
>> >> > algorithms
>> >> > which tend to fall more rapidly than vetted crypto algorithms.
>> >> >
>> >> > It's more verbose, but I strongly recommend using a pair of
>> >> > attributes
>> >> > to
>> >> > represent algorithm/value. Use the URI's defined in the latest XML
>> >> > DSIG
>> >> > document, perhaps with the "trick" that relative URI's ar a shorthand
>> >> > for
>> >> > the xmldsig namespace.
>> >> >
>> >> >        /r$
>> >> >
>> >> > --
>> >> > STSM, WebSphere Appliance Architect
>> >> > https://www.ibm.com/developerworks/mydeveloperworks/blogs/soma/
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> - James Snell
>> >>  http://www.snellspace.com
>> >>  jasnell@xxxxxxxxx
>> >>
>> >
>> >
>>
>>
>>
>> --
>> - James Snell
>>  http://www.snellspace.com
>>  jasnell@xxxxxxxxx
>>
>
>