[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Well-formedness statistics



Some questions about your methodology.
 
So you count a document served as text/plain as well-formed XML? I seem to remember reading various rants about how Internet Explorer is evil because it infers the type of documents served as text/plain instead of trusting their MIME type which supposedly leads to some security issues. So what if the MIME type was text/css, text/javascript or application/octet-stream, did you also count those as well-formed XML as long as they were served as us-ascii? 
 
 
-- 
PITHY WORDS OF WISDOM
There are always two solutions to the problem: yours and the boss's. 

________________________________

From: owner-atom-syntax@xxxxxxxxxxxx on behalf of Mark Pilgrim
Sent: Wed 6/23/2004 5:55 AM
To: atom-syntax@xxxxxxx
Subject: Well-formedness statistics




I analyzed 5096 RSS and Atom feeds chosen at random from Syndic8.com
and parsed them with Universal Feed Parser 3.0.1 using the latest
version of libxml2 as the underlying XML parser.

Actually, I analyzed more feeds than that, but I threw away feeds that
- didn't either return an HTTP status code 200 or redirect to a URL
that returned 200, or
- didn't have a recognizable root-level element of some version of RSS or Atom

3929 feeds (77.10%) were well-formed.
961 feeds (18.86%) were not well-formed due to specifying
"Content-Type: text/xml" but containing non-us-ascii characters.
206 feeds (4.04%) were not well-formed for other reasons.

As you can see, the main reason feeds fail to be well-formed is
specifying a Content-type of "text/xml" with no charset parameter, but
not actually being us-ascii.  Example:
http://www.25hoursaday.com/rss10.xml

The other 206 non-well-formed feeds suffer from a variety of problems:

- Unescaped HTML entities like © in feeds that are not Netscape
RSS 0.91.  (Note that Netscape RSS 0.91 includes a DTD which allows
this entities, so they were only counted as non-well-formed if they
occurred in feed formats other than Netscape RSS 0.91.)  Example:
http://www.sporks-r-us.com/backend.rdf

- Extra content at the end of the document.  This is apparently
sometimes caused by scripts and other flotsam auto-inserted by the
hosting provider.  Example: http://smogzer.tripod.com/smog.rdf

- Malformed XML declarations, such as a leading space.  Example:
http://www.negroplease.com/index.rdf

- Unescaped HTML markup in description.  Example:
http://squishy.goop.org/index.rdf


Postscript.  A "text/" type does not automatically make a feed
non-well-formed.  Virtually all of the feeds I analyzed were declared
with some "text/" type.  3452 feeds were declared as "text/xml", and
1064 were declared as "text/plain" or some other "text/" type.  All of
these feeds were parsed as us-ascii, but the vast majority of them
actually were us-ascii.

--
Cheers,
-Mark