Roy T. Fielding wrote:
On Apr 13, 2007, at 11:00 AM, John Panzer wrote:
Whether or not RFC 2616 is less than clear
on this point (I personally find it clear as mud), there's no ambiguity
about what happens when you send an "X-Foo: eà" header to Apache
running mod_jk sending data to a Tomcat servlet container: It passes
the data correctly if you use the ISO-8859-1 encoding, and it corrupts
the data if you use a UTF-8 encoding. At least in our tests. (Note
that this happens before the data leaves the Apache process, so there's
not even an opportunity to fix this at the servlet container level.)
Hmm, interesting... eà in UTF-8 would be %C3%A0, so the problem could
be either that something is counting "characters" instead of bytes
(causing the string to be truncated) or is removing spaces using an
algorithm that is only 7bit-clean (0xA0 & 0x7F = 0x20 or space).
That is, assuming we accept the premise that UTF-8 is valid within
HTTP header fields, which is false, but we generally try not to lose
data in Apache regardless of the standard (for robustness).
Have you tested it with a different two-byte UTF-8 character?
Sorry, my info was slightly faulty. It actually goes through Apache
and mod_jk fine, the trouble starts when it's converted to UCS-2 by the
JVM. The actual problem comes at the Tomcat servlet library level, when
it reads the UTF-8 bytes and (presumably) assumes that they are
Latin-1. This produces two garbage characters for the à in the
resulting Unicode string.
Here's our actual test, where we just pull the contents of the header
using the servlet container API and dump the results:
>From command line with terminal set to Latin-1 encoding:
curl 'http://example.org/test.jsp'
--header "x-foo: é"
-->mod_jk log: E9 (the Latin-1 encoding for é is E9)
-->tomcat log: é
>From command line with terminal set to UTF-8, or run iconv on this line
to convert and then source it:
curl 'http://example.org/test.jsp'
--header "x-foo: é"
-->mod_jk log: C3 A9 (the UTF-8 encoding for é is C3 A9)
-->tomcat log: Ã ©
There doesn't appear to be a way to get access to the raw bytes of the
headers by the time our code starts to run... and if there is, it's
certainly not convenient :).
|