[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Well-formedness statistics
Some questions about your methodology.
PITHY WORDS OF WISDOM
There are always two solutions to the problem: yours and the boss's.
From: owner-atom-syntax@xxxxxxxxxxxx on behalf of Mark Pilgrim
Sent: Wed 6/23/2004 5:55 AM
Subject: Well-formedness statistics
I analyzed 5096 RSS and Atom feeds chosen at random from Syndic8.com
and parsed them with Universal Feed Parser 3.0.1 using the latest
version of libxml2 as the underlying XML parser.
Actually, I analyzed more feeds than that, but I threw away feeds that
- didn't either return an HTTP status code 200 or redirect to a URL
that returned 200, or
- didn't have a recognizable root-level element of some version of RSS or Atom
3929 feeds (77.10%) were well-formed.
961 feeds (18.86%) were not well-formed due to specifying
"Content-Type: text/xml" but containing non-us-ascii characters.
206 feeds (4.04%) were not well-formed for other reasons.
As you can see, the main reason feeds fail to be well-formed is
specifying a Content-type of "text/xml" with no charset parameter, but
not actually being us-ascii. Example:
The other 206 non-well-formed feeds suffer from a variety of problems:
- Unescaped HTML entities like © in feeds that are not Netscape
RSS 0.91. (Note that Netscape RSS 0.91 includes a DTD which allows
this entities, so they were only counted as non-well-formed if they
occurred in feed formats other than Netscape RSS 0.91.) Example:
- Extra content at the end of the document. This is apparently
sometimes caused by scripts and other flotsam auto-inserted by the
hosting provider. Example: http://smogzer.tripod.com/smog.rdf
- Malformed XML declarations, such as a leading space. Example:
- Unescaped HTML markup in description. Example:
Postscript. A "text/" type does not automatically make a feed
non-well-formed. Virtually all of the feeds I analyzed were declared
with some "text/" type. 3452 feeds were declared as "text/xml", and
1064 were declared as "text/plain" or some other "text/" type. All of
these feeds were parsed as us-ascii, but the vast majority of them
actually were us-ascii.