[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Fwd: Default of the charset parameter
In the W3C XML SIG, Kurt Conrad and I wrote this summary for the
discussion of XML media types.
Kurt Conrad wrote...
Proposal:
The default of the charset parameter of text/xml and
application/xml is UTF-8 rather than US-ASCII (RFC 2045) or
ISO-8859-1 (RFC 2068 [HTTP/1.1]).
Criteria:
The default of this parameter is an interesting issue.
There are conflicting RFC's and a recommendation.
In RFC 2046 (MIME: Media types), the default is US-ASCII.
> 4.1.2. Charset Parameter [RFC 2046]
[snip]
> Unlike some other parameter values, the values of the charset
> parameter are NOT case sensitive. The default character set, which
> must be assumed in the absence of a charset parameter, is US-ASCII.
In RFC 2068 (HTTP/1.1), the default is ISO-8859-1.
>3.7.1 Canonicalization and Text Defaults [RFC 2068]
[snip]
> The "charset" parameter is used with some media types to define the
> character set (section 3.4) of the data. When no explicit charset
> parameter is provided by the sender, media subtypes of the "text"
> type are defined to have a default charset value of "ISO-8859-1" when
> received via HTTP. Data in character sets other than "ISO-8859-1" or
> its subsets MUST be labeled with an appropriate charset value.
HTML 4.0 further overrides this decision.
>5.2.2 Specifying the character encoding [HTML 4.0]
[snip]
>The HTTP protocol ([RFC2068], section 3.7.1) mentions
>ISO-8859-1 as a default character encoding when the
>"charset" parameter is absent from the "Content-Type" header
>field. In practice, this recommendation has proved useless
>because some servers don't allow a "charset" parameter to be
>sent, and others may not be configured to send the
>parameter. Therefore, user agents must not assume any
>default value for the "charset" parameter.
>
>To address server or configuration limitations, HTML
>documents may include explicit information about the
>document's character encoding; the META element can be used
>to provide user agents with this information.
>
>For example, to specify that the character encoding of the
>current document is "EUC-JP", a document should include the
>following META declaration:
><META http-equiv="Content-Type" content="text/html;
>charset=EUC-JP"> The META declaration must only be used when
>the character encoding is organized such that ASCII
>characters stand for themselves (at least until the META
>element is parsed). META declarations should appear as early
>as possible in the HEAD element.
>
>For cases where neither the HTTP protocol nor the META
>element provides information about the character encoding of
>a document, HTML also provides the charset attribute on
>several elements. By combining these mechanisms, an author
>can greatly improve the chances that, when the user
>retrieves a resource, the user agent will recognize the
>character encoding.
RFC 2130 (The Report of the IAB Character Set Workshop)
provides a guideline for the use of character sets on the
Internet. RFC 2130 recommends UTF-8 as the default for new
protocols.
>0: Executive summary [RFC 2130]
> This report recommends the use of ISO 10646 as the default Coded
> Character Set, and UTF-8 as the default Character Encoding Scheme in
> the creation of new protocols or new version of old protocols which
> transmit text. These defaults do not deprecate the use of other
> character sets when and where they are needed; they are simply
> intended to provide guidance and a specification for interoperability.
Since XML is a new application in the Internet, the best
default is UTF-8, as recommended by RFC2130. There is no
need to change existing HTTP/1.1 Web servers. There is no
need to consider backward compatibility of already installed
XML documents. We can start from scratch.
One potential drawback is fallback to text/plain. Since the
default of HTTP/1.1 is ISO-8859-1, fallback to text/plain
might cause corrupted data. However, we do not think that
this is a major problem.
References:
HTML 4.0 Specification
http://www.w3.org/TR/REC-html40/
RFC 2130
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2130.txt
RFC 2045
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2045.txt
RFC 2068
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2068.txt
----
MURATA Makoto muraw3c@xxxxxxxxxxxxx