From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Mon Mar 04 2002 - 07:25:26 CST
In <20020301171807.GX48327@finch-staff-1.demon.net> "Clive D.W. Feather" <clive@demon.net> writes:
>Charles Lindsey said:
>> I can't say I particularly *like* the RFC 2822 folding syntax, but it is
>> what we have got.
>Hmm.
>This isn't an area I've kept up with, but is the present state of the
>*practical* art significantly better than LALR(1) ? [Yes, I know that LR(n)
>can be done in theory, but I don't know of any usable parser-generators
>that do better than LALR(1).] Assuming it isn't, it is surely a disservice
>to require terrible kludges if they can be avoided easily.
Indeed, but these particular kludges cannot be avoided easily (that was my
point about the "Ivory Tower" solutions). Of course, if would be a doddle
in a W-Grammar :-) .
>Under other circumstances I would have said that you can just make CFWS
>handling be part of the lexical analyser. But we have a grammar that
>sometimes requires explicit tabs or explicit spaces, and sometimes requires
>exactly one of them. That makes it much harder to do at tokenisation time.
Yes, that was essentially the way RFC 822 did it AIUI, and it is probably
the way implementations do it. But deciding, in a formal specification,
where to draw the line between lexical and syntactic issues is always a
hard decision.
>> The only problem that arises from the ambiguity is that
>> the syntax allows, in some places, for a folded header to include a line
>> with nothing but whitespace in it (which is obviously a Bad Thing).
>>>> 2 dot-atom = [CFWS] dot-atom-text [CFWS]
>>>> 2 dot-atom-text = 1*atext *( "." 1*atext )
>>> An atom is a specific case of a dot-atom. This causes a problem at one
>>> place: mailbox can begin with dot-atom (as part of an addr-spec) or with
>>> atom (as part of the display-name form of a name-addr).
>>
>> Again, I think this is just because our grammar is not LALR (or whatever
>> it is that yacc accepts). Regarded as a CFG, the grammar is not ambiguous
>> here, and a parser with sufficient lookahead or backtracking would have
>> coped.
>But just how much lookahead or backtracking is needed ? Is it practical ?
Unbounded, I think (because a phrase is of unbounded length). But no
problem for a recursive descent parser like Wirth used to write.
>>>> 2 phrase = 1*word
>>> The string "ab" can be parsed as either one or two atoms, leading to an
>>> ambiguity. There is also a problem with attaching CFWS to the preceding or
>>> following word. As a lash-up, I did:
>>
>>> phrase = [CFWS] 1*( phrase-item [CFWS] )
>>> phrase-item = (atext / DQUOTE *( [FWS] qcontent ) [FWS] DQUOTE )
>>
>> Yes, but we can't change that because that syntax is taken from RFC 2822.
>> No harm arises (here or in RFC 2822),
>Is there no significance to the sequence of words ?
I don't think so. Phrases never have any semantic signficance. The grammar
stinks, but it is not ours to change.
>>>> 2 specials = "(" / ")" / ; Special characters used in
>>>> "<" / ">" / ; other parts of the syntax
>>>> "[" / "]" /
>>>> ":" / ";" /
>>>> "@" / "\" /
>>>> "," / "." /
>>>> DQUOTE
>>> This definition is never used except in a comment.
>> A convenience, as explained in RFC 2822.
>But do we need it in the grammar ?
It is useful to talk about. See the long explanation in RFC 2822 as to why
it was included.
>>>> 5* iana-token = <A token defined in an experimental
>>>> or standards-track RFC and registered with
>>>> with IANA>
>>
>>> For consistency with x-token this should read:
>>
>> > iana-token = [CFWS] token-core [CFWS]
>> > ; the token-core must be one defined in an experimental
>> > or standards-track RFC and registered with IANA
>>
>> No. It is taken from RFC 2045, so better to leave it that way.
>But there's a problem. Does iana-token include the surrounding CFWS ? In
>other words, does it have the syntax of "token" ? After all, the token in
>whatever RFC applies won't include the white space.
Yes, there are some loose ends I need to tidy up in that area. I am
discussing some of this on the ietf-822 list, where Ned Freed (who is a
prolific author of MIME-style RFCs, but not the world's most pedantic
writer) is making waves. Since he is our Area Director and sits on IESG,
we have to be careful not to upset him overmuch :-( .
>>>> 5* token-core = 1*<any (US-ASCII) CHAR except SP, CTLs,
>>>> or tspecials>
>>> Can be rewritten as a specific definition:
>>
>>> token-core = 1*("!" / %d35-39 / "*" / "+" / "-" / "." /
>>> DIGIT / ALPHA / %d94-96 / %d123-126 )
>>
>> Again, it is better to stay with RFC 2045. Also with RFC 2616 which also
>> defines 'tspecials', but differently :-( .
>I completely disagree. You've already modified the rule, so you're not
>sticking with another RFC. All the other sets of characters are defined as
>explicit lists. And it's far clearer for the reader. Oh, and I had to guess
>as to whether %d127 was allowed or not; at least my rewrite makes that
>clear.
No, we have to stay within the spirit of RFC 2045, whilst trying to second
guess how it would have been written post-2822. And the concept of
tspecials is useful, especially as it is also used in RFC 2616.
>>>> 2 msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
>>> A msg-id can occur immediately before or after CFWS (in one case - IHAVE -
>>> it is required to be followed by a single SP). Therefore the leading
>>> and trailing [CFWS] need to be removed and placed in the relevant rules, if
>>> any.
>> No, I would rather fix the problem by allowing CFWS where it says SP in
>> IHAVE.
>Fine by me. Indeed, I'd like to know if there's *any* place we have SP or
>HTAB where we couldn't change it to FWS or CFWS.
Actually, on looking as Son-of-1036 more carefully, I see that multiple
whitespace WAS permitted here (and in References), so it is a non-problem.
Currently, the only place an explict SP is now required is following the
":" in a header. That is there because much existing software breaks
without it.
>>> and the second part of newsgroups-line to:
>>
>>> [ HTAB *WSP newsgroup-description ]
>>
>> But not that one. My belief is that current usage is for all the WSP
>> between the newsgroup-name and the newsgroup-description to consist of
>> HTABs. Comments anyone?
>So would we make HTAB SP be illegal ?
It would. My understanding is that that is accepted current practice. That
is not to say that it does not happen, but there are no examples within
the "respectable" hierarchies according to my newsgroups file.
But there is a large newsgroups file not a mile from where you sit.
Perhaps it would be useful to grep for counterexamples within it.
And we could do with input from other people on this.
>>>> newsgroups-tag = %x46.6F.72 SP %x79.6F.75.72 SP
>>> [[special case]]
>>
>> I think not, because that string contains no HTAB.
>But LALR can't tell them apart. So it's [[special case]] in that sense.
>> But, in any case, you can't have a single-component newsgroup-name.
>That's not what the grammar says. [And why not ?]
It's wot it says in 5.5.1.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5