[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Resources for AtomPub parser validation
On Mon, Jun 30, 2008 at 10:30 PM, Daniel Jalkut wrote:
>
> On Jun 30, 2008, at 1:16 PM, Tim Bray wrote:
>
>> I'm a little concerned that this is bothering people who I know are smart,
>> experienced implementors. Maybe you've stumbled on a corner case?
>
> I think you might be overestimating both my smarts and experience ;) Before
> taking over with MarsEdit I had little real world experience with XML, so I
> find myself tackling many of "the finer points" without the aid of any
> historical perspective. I've only recently started getting more
> fine-toothed because I'm simultaneously changing parsers and getting more
> serious about the generic AtomPub support in MarsEdit.
>
> I find "HTML in an XHTML in an XML" to be inherently confusing.
It's not "HTML in an XHTML", either it's HTML or it's XHTML.
If it's HTML, it's not XML, so it should be considered as text by an
XML parser/writer, so you should either put it "as is" within a
<![CDATA[ section (beware of the possible ]]> contained in that HTML
snippet!) or you have to "XML espace" less-than and ampersand
characters.
HTML: <p>This is HTML<bR>with omitted optional end tags for <p>
& mixed cased <bR>
HTML in XML (1): <![CDATA[<p>This is HTML<bR>with omitted optional end
tags for <p> & mixed cased <bR>]]>
HTML in XML (2): <p>This is HTML<bR>with omitted optional end
tags for &lt;p> &amp; mixed cased &lt;bR>
It it's XHTML, it's XML, so you just put it "as is" in your entry
(beware of the namespace declarations!)
XHTML: <p>This is XHTML<br />and has nothing special</p>
XHTML is XML, so: <p xmlns="http://www.w3.org/1999/xhtml">This is
XHTML<br/>and has nothing special</p>
(note how the <br /> was turned into <br/>, without the space)
It's not any different than a backslash or double-quote in a
C/Java/[put your language here] string: \ is a special character (as &
in XML and HTML) so you have to write it \\. Similarly, " is the
string delimiter so it has to be escaped as \".
And if you start introducing regexps as strings: \ is a special char
in regexp syntax, so it has to be escaped as \\, if you want to write
this as aC/Java/... string, you have to escape each backslash once
more, resulting in \\\\. If I want to match a "(" in a regexp, I have
to escape it, so in a C/Java/... string, I have to write "\\(", not
just "\(" because otherwise the C/Java/... compiler would "eat" the
backslash as an unnecessary escape (or even throw an error)
> But it's
> probably made worse by my lack of conviction about what format the content
> "is in" when it's being edited by my users. While MarsEdit is "mostly an
> HTML editor," it is also sort of an agnostic text editor. So ... to boil
> down the points of confusion, here's a specific example being cited by a
> customer whose AtomPub implementation I'm testing MarsEdit against. This is
> what comes over the wire:
>
> <content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"
> xml:space="preserve">
> Some test HTML with an excaped ampersand in it:
> <img src="foo?bar=1&baz=2"/>
> </div></content>
>
> What's happening is the editor is ending up showing that & as just a
> "&". This sounds like it's right, based on what you're saying that the
> xhtml content gets unescaped by the XML parser, right?
>
> But the customer's contention is that the & should remain escaped in the
> HTML source, because that's how he typed it, and that's how it exists in the
> database on his server.
The "HTML" source should be a re-serialization of the infoset
generated by parsing the above snippet; so it should be & and thus
there's a bug in your editor.
http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2
--
Thomas Broyer