[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Resources for AtomPub parser validation
On Jun 30, 2008, at 1:30 PM, Daniel Jalkut wrote:
I find "HTML in an XHTML in an XML" to be inherently confusing.
Well, you're in good company. Also, you're right, it *is* confusing.
I think that the section in the RFC entitled "Text constructs" is your
friend, and would reward a couple of careful re-readings.
This is what comes over the wire:
<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"
xml:space="preserve">
Some test HTML with an excaped ampersand in it:
<img src="foo?bar=1&baz=2"/>
</div></content>
What's happening is the editor is ending up showing that & as
just a "&". This sounds like it's right, based on what you're
saying that the xhtml content gets unescaped by the XML parser, right?
But the customer's contention is that the & should remain
escaped in the HTML source, because that's how he typed it, and
that's how it exists in the database on his server.
You're into judgment-call territory. The actual URL in that link is
"foo?bar=1&baz=2". No room for argument there. In an ideal world,
that's what the user would type and that's what the user would see.
Of course, in an XML document, it's going to be encoded as "foo?
bar=1&baz=2" because "&" is a magic character in XML. In a
traditional bad-HTML-on-the-web doc, of which there are billions,
you'd quite likely just write it as
<img src="foo?bar=1&baz=2">
and the browser would be forgiving and figure out what you meant. But
if you did it the right way with &, it would work too.
Now... what to stick in the Atom feed you're generating?
If you want to take responsibility for processing whatever the user
provides and guaranteeing that it's really XML, you can just leave it
the way it is:
<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"
xml:space="preserve">
Some test HTML with an excaped ampersand in it:
<img src="foo?bar=1&baz=2"/>
</div></content>
My guess is you probably don't want to do that. In which case you
have to re-escape the user's crappy code to hide it from the XML parser:
<content type="html">
Some test HTML with an excaped ampersand in it:
<img src="foo?bar=1&amp;baz=2"/>
</content>
Ain't that hideous? Mind you, you can lose all the namespace cruft.
And who needs that <div> anyhow? Anyhow, you feed this to your
friendly local XML parser and it'll give your program the content back
as
Some test HTML with an excaped ampersand in it:
<img src="foo?bar=1&baz=2"/>
Which you can hand to WebKit or whatever HTML engine you're using, and
everything will be fine. Mind you, as I noted above, you could
actually be sloppy about the "&" and things would still work.
But to return to your user's remark... that's a UI design question.
My intuition is that what's really there is just a "&" so that's what
the user should type and that's what the user should see. But then
I'm not a UI designer and I don't know your users. There's nothing
terribly wrong with allowing the user to see the escaping, I suppose.
Well, except for, it can get confusing, as we observe. -T