[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: well-formedness error




Just a quick thanks to everyone for the practical advice on form encoding etc.


Obviously the yes it is/no it isn't style of threads isn't particularly productive, but often the citation of a bit of theory doesn't help much either. What's been suggested on this issue on the other hand can be directly translated into code and/or specific advice for other developers. Yep, thanks.

Cheers,
Danny.

Martin Duerst wrote:


At 11:06 04/06/16 -0700, Walter Underwood wrote:


This is really a question about internationalization and HTML forms,
right? Fix it by making the unknown encoding a known encoding.

It is also a question about when you do verification. Verify the stuff
before you ever put it into the DB and send an error back to the user
right then. If you let garbage in, you'll send garbage out.


Yes. And UTF-8 is extremely helpful here, because it has very
easily detectable byte patterns. See
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
and the regular expression at
http://www.w3.org/International/questions/qa-forms-utf-8.html.


I'll use UTF-8 here, but other encodings will do.


To some extent only. UTF-8 is the best choice in most cases.


1. Create the HTML page in UTF-8.

2. Declare it as UTF-8 with an http-equiv meta tag.
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">

3. Use the accept-charset attribute on the form:
  <form accept-charset="UTF-8" ...>

4. Optionally, do the same with language info:
  <meta http-equiv="content-language" content="en">
  <input type=hidden name=la value=en>


No. While the browser will make sure that the reply gets sent
back encoded in UTF-8, there is absolutely no way to make sure
that the language of the posting is 'en'. If you want language
information, you have to give the user a chance to select it.


5. If you allow different encodings on different pages, you will
   also need to pass in the encoding for this form:
  <input type=hidden name=charset value=UTF-8>


Much better: Always use UTF-8. Simplifies a lot and reduces
the chance for errors a lot.


Using UTF-8 avoids the ISO 8859-1 vs. code page 1252 issues.
Microsoft products are happy to use UTF-8. For example, posting
from MS Word into a UTF-8 MSIE page should work fine for anything
newer than Word 95 (Japanese verson was Shift JIS internally).


The copy/paste actually happens with UTF-16, but nobody needs to
care about it. It's all just characters.


This really does work -- we've been doing it in Ultraseek for about
four years. "accept-charset" should default to the encoding of the
HTML document, but being explicit seems to help with some browsers.


Interesting. Can you tell me what these browsers are?


I've never seen this documented, which is part of the reason that
we mostly have an English Wide Web.


See http://www.w3.org/International/questions/qa-forms-utf-8.html
and http://www.w3.org/Talks/1999/0830-tutorial-unicode-mjd/.
The later is one version of a Tutorial that was given over years
at the Unicode Conference.

Regards, Martin.




--

Raw
http://dannyayers.com