[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Determination of encoding/character set



In the W3C XML SIG, Kurt Conrad and I wrote this summary for the 
discussion of XML media types.


Kurt Conrad wrote...
Proposal:

MIME types (text/xml and application/xml) for XML documents
have the charset parameter.  The encoding method is determined
by this parameter only.  All other information is for error
recovery only.


Criteria:

This issue has been very controversial.  Which should determine
encoding, the charset parameter of the MIME header, or the encoding
PI (or BOM)?  

This issue is very closely related with the next issue
(text/xml and/or application/xml?), as the top-level type
"text" provides have the charset parameter.

There are two relevant RFC's, namely RFC 2130 (The Report of the IAB
Character Set Workshop) and RFC 2046 (MIME Part Two: Media Types).

RFC 2130 provides a guideline for the use of character sets on the
Internet.  For XML to be a good citizen in the Internet, we have to
follow this guideline wherever possible.

In RFC 2130, determination of character encoding is a
protocol issue.  RFC 2130 clearly recommends the use of MIME
headers to determine character encoding (Character Encoding
Scheme in the terminology of RFC 2130).

>3.3:  Determining which values of CCS, CES, and TES are used [RFC 2130]
>
>   To completely specify which CCS, CES, and TES are used in a specific
>   text transmission, there needs to be a consistent set of labels for
>   specifying which CCS, CES, and TES are used.  Once the appropriate
>   mechanisms have been selected, there are six techniques for attaching
>   these labels to the data.
>
>   The labels themselves are named and registered, either with IANA
>   [IANA] or with some other registry.  Ideally, their definitions are
>   retrievable from some registration authority.
>
>   Labels may be determined in one of the following ways:
>
>   -  Determined by guessing, where the receiver of the text has to
>      guess the values of the CCS, CES, and TES. For example: "I got
>      this from Sweden so it's probably  ISO-8859-1."  This is
>      obviously not a very foolproof way to decode text.
>   -  Determined by the standard, where the protocol used to transmit
>      the data has made documented choices of CCS, CES, and TES in the
>      standard. Thus, the encodings used are known through the
>      access protocol, for example HTTP [HTTP] uses (but is not
>      limited to) ISO-8859-1, SMTP uses US-ASCII.
>   -  Attached to the transfer envelope, where the descriptive labels are
>      attached to the wrapper placed around the text for transport.
>      MIME headers are a good example of this technique.
>   -  Included in the data stream, where the data stream itself has
>      been encoded in such a way as to signal the character set used.
>      For example, ISO-2022 encodes the data with escape sequences to
>      provide information on the character subset currently being used.
>   -  Agreed by prior bilateral agreement, where some out-of-band
>      negotiation has allowed the text transmitter and receiver to
>      determine the CCS, CES, and  TES for the transmitted text.
>   -  Agreed to by negotiation during some phase, typically
>      initialization of the protocol.
>
>3.3.1:  Recommendations for value specification mechanisms [RFC 2130]
>
>   While each of these techniques (with the  exception of guessing) is
>   useful in particular situations, interoperability requires a more
>   consistent set of techniques.  Thus, we recommend that MIME
>   registered values be used for all tagging of character sets and
>   languages UNLESS there is an existing mechanism for determining the
>   required information using one of the other techniques (except
>   guessing).  This recommendation will require a fair bit of work on
>   the part of protocol designers, implementors, the IETF, the IESG, and
>   the IAB.

The top-level media type "text" already provides the charset
parameter (RFC2046).  Thus, if we use text/*, encoding
should determined by this parameter only.

>4.1.2.  Charset Parameter [RFC 2046]
>
>   A critical parameter that may be specified in the Content-Type field
>   for "text/plain" data is the character set.  This is specified with a
>   "charset" parameter, as in:
>
>     Content-type: text/plain; charset=iso-8859-1
>
>   Unlike some other parameter values, the values of the charset
>   parameter are NOT case sensitive.  The default character set, which
>   must be assumed in the absence of a charset parameter, is US-ASCII.
>
>   The specification for any future subtypes of "text" must specify
>   whether or not they will also utilize a "charset" parameter, and may
>   possibly restrict its values as well.  For other subtypes of "text"
>   than "text/plain", the semantics of the "charset" parameter should be
>   defined to be identical to those specified here for "text/plain",
>   i.e., the body consists entirely of characters in the given charset.
>   In particular, definers of future "text" subtypes should pay close
>   attention to the implications of multioctet character sets for their
>   subtype definitions.
>
>   The charset parameter for subtypes of "text" gives a name of a
>   character set, as "character set" is defined in RFC 2045.  The rules
>   regarding line breaks detailed in the previous section must also be
>   observed -- a character set whose definition does not conform to
>   these rules cannot be used in a MIME "text" subtype.

We have to use the top-level type "application" for
transmitting XML documents in UTF-16 or UCS-2 via the SMTP
protocol, because of the line termination rule of MIME.
However, even in this case, RFC 2046 suggests the charset
parameter (4.1.2).

>   Other media types than subtypes of "text" might choose to employ the
>   charset parameter as defined here, but with the CRLF/line break
>   restriction removed.  Therefore, all character sets that conform to
>   the general definition of "character set" in RFC 2045 can be
>   registered for MIME use.

HTML 4.0 already uses the charset parameter.

>5.2 Character encodings [HTML 4.0]
>
>What this specification calls a character encoding is known
>by different names in other specifications (which may cause
>some confusion). However, the concept is largely the same
>across the Internet. Also, protocol headers, attributes, and
>parameters referring to character encodings share the same
>name -- "charset" -- and use the same values from the [IANA]
>registry (see [CHARSETS] for a complete list).
>
>The "charset" parameter identifies a character encoding,
>which is a method of converting a sequence of bytes into a
>sequence of characters. This conversion fits naturally with
>the scheme of Web activity: servers send HTML documents to
>user agents as a stream of bytes; user agents interpret them
>as a sequence of characters. The conversion method can range
>from simple one-to-one correspondence to complex switching
>schemes or algorithms.


How do we specify the charset parameter?  HTML 4.0 
talks about server configuration.

>5.2.2 Specifying the character encoding [HTML 4.0]
>
>How does a server determine which character encoding applies
>for a document it serves? Some servers examine the first few
>bytes of the document, or check against a database of known
>files and encodings. Many modern servers give Web masters
>more control over charset configuration than old servers do. 
>Web masters should use these mechanisms to send out a
>"charset" parameter whenever possible, but should take care
>not to identify a document with the wrong "charset"
>parameter value.

It has been argued that casual users cannot set the charset
parameter.  However, the most popular WWW server, namely
Apache, allows casual users to set the charset parameter
easily.  A casual user only has to make a file named .htaccess 
in his or her directory and add a line as below:

	AddType  'text/xml; charset=utf-8'    xml

(See http://www.apache.org/docs/mod/mod_mime.html#addtype).

Some WWW servers do not provide this feature (.htaccess),
but it is usually possible to use file extensions to specify
the charset parameter.  For example, the file extension
"xml8" specifies the charset parameter "utf-8", if the WWW
server configuration file has a line as below:

   type="text/xml; charset=utf-8" exts=xml8


References:

Apache HTTP Server Version 1.3 / Module mod_mime / Directive AddType
   http://www.apache.org/docs/mod/mod_mime.html#addtype

HTML 4.0 Specification
   http://www.w3.org/TR/REC-html40/

RFC 2130
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2130.txt

RFC 2045
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2045.txt

RFC 2046
   http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2046.txt




----
MURATA Makoto  muraw3c@xxxxxxxxxxxxx