Re: Path field differences: part two of two

From: Bruce Lilly (blilly@erols.com)
Date: Sat Jul 03 2004 - 12:50:20 CDT


Russ Allbery wrote:

> when representing the header in various situations
> where the field name and field body are separate, you cannot reliably
> reproduce the original article.

I don't follow that last bit; could you please clarify?

> Here's the text from the NNTP draft, which I think is fairly precise:
>
> The headers of an article consist of one or more header lines.

I strongly disagree about precision, right from the start. "Header" has
a precise well-defined meaning dating back to at least RFC 561, and it
does not mean the same thing as "header field" (RFC 724 and later). A
message may indeed contain multiple headers, e.g. if it contains a MIME
composite media type. Likewise "lines" is about as imprecise as it's
possible to be.

> Each
> header line consists of a header name, a colon, a space, the header
> content, and a CRLF in that order. The name consists of one or more
> printable US-ASCII characters other than colon and, for the purposes of
> this specification, is not case-sensitive. There MAY be more than one
> header line with the same name. The content MUST NOT contain CRLF; it

Please note carefully that "content MUST NOT contain CRLF". With the
earlier remark about a "line", that means that folding is verboten.

> MAY be empty. A header may be "folded"; that is, a CRLF pair may be
> placed before any TAB or space in the line;

Let me get this straight: there's a name, a colon, a space, "content", and
a CRLF in that order. "content" "MUST NOT contain CRLF". But "a CRLF [...]
may be placed before any TAB or space in the line". Ow, that hurts!

> there MUST still be some
> other octet between any two CRLF pairs in a header line. (Note that
> folding means that the header line occupies more than one line when

Right, one "line" occupies more than one "line". Okaaay, moving right along...
So CRLF SP CRLF SP CRLF is OK, right? See RFC 2822.

> displayed or transmitted; nevertheless it is still referred to as "a"
> header line.) The presence or absence of folding does not affect the
> meaning of the header line; that is, the CRLF pairs introduced by
> folding are not considered part of the header content. Header lines
> SHOULD NOT be folded before the space after the colon that follows the
> header name, and SHOULD include at least one octet other than %x09 or
> %x20 between CRLF pairs.

Ooh, now we're (rather they're) mixing TAB and "space" with %x09 and %x20.
Which charset? Is charset even defined in that draft?

> However, if an article has been received from
> elsewhere with one of these, clients and servers MAY transfer it to the
> other without re-folding it.
>
> and in ABNF:
>
> header = header-name ":" [CRLF] SP header-content CRLF
> header-name = 1*A-NOTCOLON
> header-content = *(S-CHAR / [CRLF] WS)

Remember that "content MUST NOT contain CRLF"? WTF!?!

> A-NOTCOLON = %x21-39 / %x3B-7E ; exclude ":"

Sigh. Why not simply use the RFC [2]822 terms which adequately define what
is needed? Now one has to look very carefully for inconsistencies.

> where software must cope with:
>
> S-CHAR = %x21-FF
>
> but should only generate:
>
> S-CHAR = P-CHAR
> P-CHAR = A-CHAR / UTF8-non-ascii
> A-CHAR = %x21-7E
> UTF8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4
> UTF8-2 = %xC2-DF UTF8-tail
> UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
> %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
> UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
> %xF4 %x80-8F 2UTF8-tail
> UTF8-tail = %x80-BF

That part may or may not be precise; I still have a headache from the
"content MUST NOT contain CRLF"/may contain CRLF bit. Precise or not,
it certainly isn't concise or clear.

I hope it's clear now what I meant earlier by advocating use of "standard
terminology" and "self-consistency". It really does matter.

> One of the requirements for Usenet articles
> is that they can be safely conveyed via NNTP.

I think that's the tail wagging the dog; surely news articles appeared
before a Network News Transport Protocol came about -- one of the
requirements of such a protocol is that it be capable of transporting
news articles. It is unfortunate that we have had to even consider
going back to the article specification to add restrictions to cope
with poor transport protocol design that was specifically supposed to
be able to transport articles in the first place. Even more of a
disgrace that we have then gone back to the group responsible for the
underlying message format, hat in hand, saying, "look, some bozos
implemented some software that crashes when handed message-ids of
moderate but legal length, therefore please impose a severe length limit
on everybody else in order to protect that broken software".

> There was previous discussion on this in the mailing list and the current
> wording was the result of recognizing that non-domain Path identities are
> in widespread use by people who have had the same Path identity ever since
> it was a registered UUCP name. I believe the current language was
> intended to basically say that you only get to use your non-DNS Path
> identity if you have it for legacy reasons and you basically can't create
> a new one, but I don't know if it successfully conveys that intention.

It does not. Moreover it requires uniqueness but provides no mechanism
for an agent to determine or ensure uniqueness, or to establish whether
or not a name is unique. I don't even know if Mel Pleasant at Rutgers
still coordinates UUCP maps, and even if so, those maps were full of
duplicates last time I looked (I even made some scripts for finding such
duplicates available years ago).




This archive was generated by hypermail 2.1.7.