[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slug header encoding



Roy T. Fielding wrote:

On Apr 13, 2007, at 11:00 AM, John Panzer wrote:
Whether or not RFC  2616 is less than clear on this point (I personally find it clear as mud), there's no ambiguity about what happens when you send an "X-Foo: eà" header to Apache running mod_jk sending data to a Tomcat servlet container: It passes the data correctly if you use the ISO-8859-1 encoding, and it corrupts the data if you use a UTF-8 encoding.  At least in our tests.  (Note that this happens before the data leaves the Apache process, so there's not even an opportunity to fix this at the servlet container level.)

Hmm, interesting... eà in UTF-8 would be %C3%A0, so the problem could
be either that something is counting "characters" instead of bytes
(causing the string to be truncated) or is removing spaces using an
algorithm that is only 7bit-clean (0xA0 & 0x7F = 0x20 or space).
That is, assuming we accept the premise that UTF-8 is valid within
HTTP header fields, which is false, but we generally try not to lose
data in Apache regardless of the standard (for robustness).

Have you tested it with a different two-byte UTF-8 character?
Sorry, my info was slightly faulty.  It actually goes through Apache and mod_jk fine, the trouble starts when it's converted to UCS-2 by the JVM. The actual problem comes at the Tomcat servlet library level, when it reads the UTF-8 bytes and (presumably) assumes that they are Latin-1.  This produces two garbage characters for the à in the resulting Unicode string.

Here's our actual test, where we just pull the contents of the header using the servlet container API and dump the results:

>From command line with terminal set to Latin-1 encoding:
curl 'http://example.org/test.jsp' --header "x-foo: é"
-->mod_jk log: E9  (the Latin-1 encoding for é is E9)
-->tomcat log: é

>From command line with terminal set to UTF-8, or run iconv on this line to convert and then source it:
curl 'http://example.org/test.jsp' --header "x-foo: é"
-->mod_jk log:  C3 A9  (the UTF-8 encoding for é is C3 A9)
-->tomcat log: Ã ©

There doesn't appear to be a way to get access to the raw bytes of the headers by the time our code starts to run... and if there is, it's certainly not convenient :).

--
AbstractioneerJohn Panzer
System Architect
http://abstractioneer.org