[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Change Proposal to HTML WG to fix the algorithm for generating Atom feeds from HTML content


this relates to an HTML-WG discussion about the algorithm to create Atom feeds from HTML (<http://dev.w3.org/html5/spec/Overview.html#atom>).

See <http://www.w3.org/Bugs/Public/show_bug.cgi?id=7806> and <http://www.w3.org/html/wg/tracker/issues/86> for more context on how we got here.

Best regards, Julian

On 06.04.2010 23:12, Julian Reschke wrote:

below is a change proposal for this issue.

Note that an obvious alternative to fixing the algorithm would be to
remove the section completely.

Best regards,


-- snip --

The HTML5 spec contains an algorithm for producing an Atom (RFC4287)
feed document from an HTML page.

The definition both relaxes a MUST-level requirement from RFC4287, but
also adds a needless restriction.

Also, it's not clear *at all* whether this is a feature that people
really want, and if they do, whether it needs to be part of HTML5. Given
the fact that it's non-trivial to generate a valid Atom feed from HTML,
but the reverse *is* trivial, we should also consider removing this
feature altogether (I'd be happy to write a 2nd change proposal if
people want to see that as well).


Instructions to derive a secondary format from HTML documents shouldn't
be misleading, and also should make clear which conditions need to be
met to produce valid documents.


There are two problems, both with the following step (4.15.1, step 15.9
as of April 6):


Let id be a user-agent-defined undereferenceable yet globally unique
valid absolute URL. The same absolute URL should be generated for each
run of this algorithm when given the same input. Let has-alternate be

Problem #1: RFC 4287 does not require the ID to be undereferenceable.
This was a conscious decision of the IETF AtomPub WG. There's absolutely
no point in adding this requirement, except for the spec author's
distaste for URIs that are both dereferenceable *and* act as a globally
unique and stable identifier.

Note from

"...Though the IRI might use a dereferencable scheme, Atom Processors
MUST NOT assume it can be dereferenced."

Problem #2: RFC 4287 makes it a MUST-level requirement to generate the
same ID every time the feed is regenerated:


"When an Atom Document is relocated, migrated, syndicated, republished,
exported, or imported, the content of its atom:id element MUST NOT
change. Put another way, an atom:id element pertains to all
instantiations of a particular Atom entry or feed; revisions retain the
same content in their atom:id elements. It is suggested that the atom:id
element be stored along with the associated resource."

HTML5 relaxes this to a should-level requirement.

I do agree that generating valid Atom feeds from HTML *is* hard, but
violating a MUST-level requirement from the Atom spec is not acceptable.

Proposed changes:

For issue #1:

Leave out "undereferencable", changing the sentence to:

"Let id be a user-agent-defined yet globally unique valid absolute URL."

For issue #2:


"The same absolute URL should be generated for each run of this
algorithm when given the same input."


"The same absolute URL must be generated for each run of this algorithm
when given the same input. If this requirement can not be fulfilled,
then generating a valid Atom feed is not possible and this algorithm
should be aborted."


1. Positive Effects

Consistency between the applicable specs. Also, authors are correctly
informed about what it takes to generate proper Atom feeds.

2. Negative Effects


3. Conformance Classes Changes

Atom feed generators are actually required to generate valid Atom
documents (with respect to atom:id).

4. Risks