[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Default of the charset parameter



In the W3C XML SIG, Kurt Conrad and I wrote this summary for the 
discussion of XML media types.


Kurt Conrad wrote...
Proposal:

The default of the charset parameter of text/xml and
application/xml is UTF-8 rather than US-ASCII (RFC 2045) or
ISO-8859-1 (RFC 2068 [HTTP/1.1]).

Criteria:

The default of this parameter is an interesting issue.
There are conflicting RFC's and a recommendation.  

In RFC 2046 (MIME: Media types), the default is US-ASCII.

> 4.1.2.  Charset Parameter [RFC 2046]
[snip]
>   Unlike some other parameter values, the values of the charset
>   parameter are NOT case sensitive.  The default character set, which
>   must be assumed in the absence of a charset parameter, is US-ASCII.

In RFC 2068 (HTTP/1.1), the default is ISO-8859-1.

>3.7.1 Canonicalization and Text Defaults [RFC 2068]
[snip]
>   The "charset" parameter is used with some media types to define the
>   character set (section 3.4) of the data. When no explicit charset
>   parameter is provided by the sender, media subtypes of the "text"
>   type are defined to have a default charset value of "ISO-8859-1" when
>   received via HTTP. Data in character sets other than "ISO-8859-1" or
>   its subsets MUST be labeled with an appropriate charset value.

HTML 4.0 further overrides this decision.

>5.2.2 Specifying the character encoding  [HTML 4.0]
[snip]
>The HTTP protocol ([RFC2068], section 3.7.1) mentions
>ISO-8859-1 as a default character encoding when the
>"charset" parameter is absent from the "Content-Type" header
>field. In practice, this recommendation has proved useless
>because some servers don't allow a "charset" parameter to be
>sent, and others may not be configured to send the
>parameter. Therefore, user agents must not assume any
>default value for the "charset" parameter.
>
>To address server or configuration limitations, HTML
>documents may include explicit information about the
>document's character encoding; the META element can be used
>to provide user agents with this information.
>
>For example, to specify that the character encoding of the
>current document is "EUC-JP", a document should include the
>following META declaration:

><META http-equiv="Content-Type" content="text/html;
>charset=EUC-JP"> The META declaration must only be used when
>the character encoding is organized such that ASCII
>characters stand for themselves (at least until the META
>element is parsed). META declarations should appear as early
>as possible in the HEAD element.
>
>For cases where neither the HTTP protocol nor the META
>element provides information about the character encoding of
>a document, HTML also provides the charset attribute on
>several elements. By combining these mechanisms, an author
>can greatly improve the chances that, when the user
>retrieves a resource, the user agent will recognize the
>character encoding.

RFC 2130 (The Report of the IAB Character Set Workshop)
provides a guideline for the use of character sets on the
Internet.  RFC 2130 recommends UTF-8 as the default for new
protocols.

>0: Executive summary [RFC 2130]
>   This report recommends the use of ISO 10646 as the default Coded
>   Character Set, and UTF-8 as the default Character Encoding Scheme in
>   the creation of new protocols or new version of old protocols which
>   transmit text. These defaults do not deprecate the use of other
>   character sets when and where they are needed; they are simply
>   intended to provide guidance and a specification for interoperability.

Since XML is a new application in the Internet, the best
default is UTF-8, as recommended by RFC2130.  There is no
need to change existing HTTP/1.1 Web servers.  There is no
need to consider backward compatibility of already installed
XML documents.  We can start from scratch.

One potential drawback is fallback to text/plain.  Since the
default of HTTP/1.1 is ISO-8859-1, fallback to text/plain
might cause corrupted data.  However, we do not think that 
this is a major problem.  


References:

HTML 4.0 Specification
   http://www.w3.org/TR/REC-html40/

RFC 2130
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2130.txt

RFC 2045
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2045.txt

RFC 2068
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2068.txt



----
MURATA Makoto  muraw3c@xxxxxxxxxxxxx