From: Erland Sommarskog (sommar-usefor@algonet.se)
Date: Sat Jul 14 2001 - 18:31:37 CDT
Dirk Nimmich <nimmich@uni-muenster.de> writes:
> While I too think that most newsreaders get the charset declaration
> accidently right: Why shouldn't the author of the response
> determine and declare the correct charset for _his_ message?
But then that should be an active edit of his part. And it really
begs the question why he should change the subject line only to
fix the charset?
> I disagree, however, with the opinion that different (MIME)
> encodings or different folding of a subject constitutes a "subject
> change": A transfer encoding does not change the content, only its
> representation on the wire.
If you change from iso-8859-1 to iso-8859-2 and all characters in
the subject line are present in both charsets, you are not changing
anything.
But if you have an 8-bit string of which you don't know the charset,
you are obviously changing it if you encode into something.
> I don't see a reason for this. Subject threading based on a byte by
> byte comparison of the on-the-wire representation hasn't worked for
> a long time now and probably won't ever work in the future again.
Nevertheless many newsreaders do look at the subject. And this is
essential for a good newsreader. You cannot achieve sensible threading
on References alone. Two articles may have the same three-months old
ancestor, but they appear in completely disjunct subthreads. A newsreader
that put them in the same thread is severly crippled in my opinion.
A good newsreader should, in my opnion, thread articles with different
subjects in the same thread, only if they are have a common ancestor
loaded (= "unread" in most cases.)
Then of course newsreader authors could explore the field of fuzzy
comparisons to handle difference in spacing etc, but that's another
story.
After all, we are documenting current practice, and current practice
shows that if you change the subject line to change the encoding, you
are increasing the risk of causing problems.
> > Likewise, if the followup-agent can conclude that the
> > the subject line is not in UTF-8 despite that it contains 8bit
> > characters, the followup agent should not make any attempt to guess
> > the character set and correct it to UTF-8.
>
> It can display a generic "unknown" character (like a question mark
> or a box;
How the UTF-8 aware reader displays the non-UTF-8 8-bit subject line is
another matter, but it is hardly rocket science. We have a lot of
pure 8-bit floating around today, and problems are scarce.
> This does not only apply for characters in headers; the same
> problem exists for characters in the body. Would you also demand to
> post with the same Content-Type declaration as the original
> posting, not to speak of Content-Transfer-Encoding? This would not
> make much sense, if you ask me.
And thus I am not asking for it.
-- Erland Sommarskog, Stockholm, sommar@algonet.se