From: Clive D.W. Feather (clive@demon.net)
Date: Fri Mar 01 2002 - 11:18:07 CST
Charles Lindsey said:
>> I decided to test whether this syntax is actually unambiguous. So I
>> converted it all into yacc syntax mostly mechanically, then looked to see
>> what happened when I gave it to yacc.
> Thanks. That looks like a useful piece of work.
You're welcome. It was painful enough, however, that I'm not going to do it
again until we think we have the syntax finalised.
>> Firstly, there is definitely a problem with the white space that I haven't
>> been able to track down yet, and don't really have the time to do so. But I
>> can say that it's to do with [CFWS] and CFWS being adjacent.
> Academics in Ivory Towers will gladly prove that the grammar could have
> been written, in a finite number of rules, without that ambiguity. That is
> indeed true, but what they omit to tell you is that "finite" means 5 times
> the present number of rules, and that the result would be totally
> human-incomprehensible.
That doesn't surprise me.
> I can't say I particularly *like* the RFC 2822 folding syntax, but it is
> what we have got.
Hmm.
This isn't an area I've kept up with, but is the present state of the
*practical* art significantly better than LALR(1) ? [Yes, I know that LR(n)
can be done in theory, but I don't know of any usable parser-generators
that do better than LALR(1).] Assuming it isn't, it is surely a disservice
to require terrible kludges if they can be avoided easily.
Under other circumstances I would have said that you can just make CFWS
handling be part of the lexical analyser. But we have a grammar that
sometimes requires explicit tabs or explicit spaces, and sometimes requires
exactly one of them. That makes it much harder to do at tokenisation time.
> The only problem that arises from the ambiguity is that
> the syntax allows, in some places, for a folded header to include a line
> with nothing but whitespace in it (which is obviously a Bad Thing).
Ah. I hadn't even looked into that.
>> Secondly, yacc seems to be unable to cope with the idea that CRLF might be
>> part of folding white space or might be the end of a real line. For now
>> I've defined FWS as simply 1*WSP.
> Agreed. It is another example of the same problem, and my wording above
> now includes a fix for it.
Again, it would be lexical issue were it not for the fact that sometimes we
insist on spaces or tabs (e.g. in "batch-header").
>> Rewritten as:
>> CFWS = (FWS / comment [FWS]) *(comment [FWS])
>
> I presume that was just to keep yacc happy.
Yes. It was the usual issue of left- versus right-recursion.
>>> 2 dot-atom = [CFWS] dot-atom-text [CFWS]
>>> 2 dot-atom-text = 1*atext *( "." 1*atext )
>> An atom is a specific case of a dot-atom. This causes a problem at one
>> place: mailbox can begin with dot-atom (as part of an addr-spec) or with
>> atom (as part of the display-name form of a name-addr).
>
> Again, I think this is just because our grammar is not LALR (or whatever
> it is that yacc accepts). Regarded as a CFG, the grammar is not ambiguous
> here, and a parser with sufficient lookahead or backtracking would have
> coped.
But just how much lookahead or backtracking is needed ? Is it practical ?
Yes, I know this is an imported rule. But I can't right now see the logic
behind the distinction.
> Oops! That's because the syntax of no-fold-literal is broken (see below).
Okay.
>>> 2 phrase = 1*word
>> The string "ab" can be parsed as either one or two atoms, leading to an
>> ambiguity. There is also a problem with attaching CFWS to the preceding or
>> following word. As a lash-up, I did:
>
>> phrase = [CFWS] 1*( phrase-item [CFWS] )
>> phrase-item = (atext / DQUOTE *( [FWS] qcontent ) [FWS] DQUOTE )
>
> Yes, but we can't change that because that syntax is taken from RFC 2822.
> No harm arises (here or in RFC 2822),
Is there no significance to the sequence of words ?
> and the CFWS part of it is just an
> example of the usual problem, as discussed above.
Understood.
>>> 2 specials = "(" / ")" / ; Special characters used in
>>> "<" / ">" / ; other parts of the syntax
>>> "[" / "]" /
>>> ":" / ";" /
>>> "@" / "\" /
>>> "," / "." /
>>> DQUOTE
>> This definition is never used except in a comment.
> A convenience, as explained in RFC 2822.
But do we need it in the grammar ?
>> A display-name is a phrase, which is a sequence of words and so can end
>> with CFWS. So remove the leading [CFWS] and put it into the first option of
>> name-addr instead.
> Another example of the usual problem. We are stuck with the way RFC 2822
> does it.
Sigh.
>>> 5* iana-token = <A token defined in an experimental
>>> or standards-track RFC and registered with
>>> with IANA>
>
>> For consistency with x-token this should read:
>
> > iana-token = [CFWS] token-core [CFWS]
> > ; the token-core must be one defined in an experimental
> > or standards-track RFC and registered with IANA
>
> No. It is taken from RFC 2045, so better to leave it that way.
But there's a problem. Does iana-token include the surrounding CFWS ? In
other words, does it have the syntax of "token" ? After all, the token in
whatever RFC applies won't include the white space.
>>> 5* token-core = 1*<any (US-ASCII) CHAR except SP, CTLs,
>>> or tspecials>
>> Can be rewritten as a specific definition:
>
>> token-core = 1*("!" / %d35-39 / "*" / "+" / "-" / "." /
>> DIGIT / ALPHA / %d94-96 / %d123-126 )
>
> Again, it is better to stay with RFC 2045. Also with RFC 2616 which also
> defines 'tspecials', but differently :-( .
I completely disagree. You've already modified the rule, so you're not
sticking with another RFC. All the other sets of characters are defined as
explicit lists. And it's far clearer for the reader. Oh, and I had to guess
as to whether %d127 was allowed or not; at least my rewrite makes that
clear.
>>> Appendix B.3.2 - Template instantiations
>>> Followup-To-content = Newsgroups-content / [FWS] "poster" [FWS]
>> [[special case]]
> Covered by verbiage.
[and elsewhere] Sure, but I thought it worth pointing out where the grammar
had an issue.
>>> Mail-Copies-To-content
>>> = copy-addr / [CFWS] ( "nobody" / "poster" ) [CFWS]
>> [[special case]]
> I think not. A copy-addr MUST have an '@' somewhere inside it.
Maybe, but again it's not easy to parse.
>>> Path-content = [FWS] *( path-identity [FWS] path-delimiter [FWS] )
>>> tail-entry [FWS]
>> I can't get yacc to accept this as written. If the trailing [FWS] is made
>> part of tail-entry instead, this resolves it.
> I think this is just another yacc LALR problem.#
Very probably.
> It is better left as it is
> to make agreement with the "[FWS] at each end" invariant easier to verify.
Ho hum.
>>> 2 msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
>> A msg-id can occur immediately before or after CFWS (in one case - IHAVE -
>> it is required to be followed by a single SP). Therefore the leading
>> and trailing [CFWS] need to be removed and placed in the relevant rules, if
>> any.
> No, I would rather fix the problem by allowing CFWS where it says SP in
> IHAVE.
Fine by me. Indeed, I'd like to know if there's *any* place we have SP or
HTAB where we couldn't change it to FWS or CFWS.
>> and the second part of newsgroups-line to:
>
>> [ HTAB *WSP newsgroup-description ]
>
> But not that one. My belief is that current usage is for all the WSP
> between the newsgroup-name and the newsgroup-description to consist of
> HTABs. Comments anyone?
So would we make HTAB SP be illegal ?
>>> newsgroups-tag = %x46.6F.72 SP %x79.6F.75.72 SP
>> [[special case]]
>
> I think not, because that string contains no HTAB.
But LALR can't tell them apart. So it's [[special case]] in that sense.
> But, in any case, you can't have a single-component newsgroup-name.
That's not what the grammar says. [And why not ?]
> These should have been:
>
> 2* no-fold-literal = "[" *( dtext / strict-quoted-pair ) "]"
> 2* no-fold-quote = DQUOTE *( strict-qtext / strict-quoted-pair ) DQUOTE
Which fixes all the related issues.
>>> posting-sender-parameter
>>> = [CFWS] Posting-Sender-token [CFWS] "=" [CFWS]
>>> ( sender-value /
>>> DQUOTE sender-value DQUOTE ) [CFWS]
>> Secondly, and rather more worryingly, a
>> sender-value can begin with a quoted-string, meaning that there's a parsing
>> problem with a DQUOTEd sender-value.
> But the second is REAL NASTY :-( .
[...]
See separate thread.
>>> sender-value = ( mailbox / "verified" )
>> [[special case]]
> I think not (no `@` in "verified" again).
Again, too much lookahead needed.
-- Clive D.W. Feather | Work: <clive@demon.net> | Tel: +44 20 8371 1138 Internet Expert | Home: <clive@davros.org> | Fax: +44 870 051 9937 Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646 Thus plc | | NOTE: fax number change