From: Charles Lindsey (chl@clw.cs.man.ac.uk)
Date: Wed Feb 27 2002 - 13:01:20 CST
In <20020226110154.GA77146@finch-staff-1.demon.net> "Clive D.W. Feather" <clive@demon.net> writes:
>Charles Lindsey said:
>> The complete Collected Syntax as fixed is reproduced below.
>I decided to test whether this syntax is actually unambiguous. So I
>converted it all into yacc syntax mostly mechanically, then looked to see
>what happened when I gave it to yacc.
Thanks. That looks like a useful piece of work.
>Firstly, there is definitely a problem with the white space that I haven't
>been able to track down yet, and don't really have the time to do so. But I
>can say that it's to do with [CFWS] and CFWS being adjacent. If I replace
>all uses of [CFWS] with [DUMMY CFWS] there are no problems (modulo the
>other changes mentioned below). But if I remove DUMMY then I get a whole
>load of conflicts in the grammar.
Yes, that is exactly what I would have expected, and it all stems from RFC
2822.
Academics in Ivory Towers will gladly prove that the grammar could have
been written, in a finite number of rules, without that ambiguity. That is
indeed true, but what they omit to tell you is that "finite" means 5 times
the present number of rules, and that the result would be totally
human-incomprehensible.
I can't say I particularly *like* the RFC 2822 folding syntax, but it is
what we have got. The only problem that arises from the ambiguity is that
the syntax allows, in some places, for a folded header to include a line
with nothing but whitespace in it (which is obviously a Bad Thing).
So RFC 2822 has a MUST NOT (its section 3.2.3) to exclude that possibility
and we had essentially the same words in our section 4.2.3. However, I had
already rewritten that, and the present text is as follows:
FWS occurs at many places in the syntax (usually within a CFWS) in
order to allow the inclusion of comments, whitespace and folding. The
syntax is in fact ambiguous insofar as it sometimes allows two
consecutive instantiations of FWS (as least one of which is always
optional), or of FWS followed by an explicit CRLF. However, all such
cases MUST be treated as if the optional instantiation (or one of
them) had not been present. It is thus precluded that any line of a
header should be made up of whitespace characters and nothing else
(for such a line might otherwise have been interpreted by a non-
compliant agent as the separator between the headers and the body of
the article).
NOTE: This does not lead to semantic ambiguity because, unless
specifically stated otherwise, the presence or absence of a
comment or additional WSP has no semantic meaning and, in
particular, it is a matter of indifference whether it forms a
part of the syntactic construct preceding it or the one
following it.
[That is not as long as it actually seems, because it incorporates various
bits of text now removed from other places in the draft.]
>Secondly, yacc seems to be unable to cope with the idea that CRLF might be
>part of folding white space or might be the end of a real line. For now
>I've defined FWS as simply 1*WSP.
Agreed. It is another example of the same problem, and my wording above
now includes a fix for it.
>The note [[special case]] means that yacc complained about a conflict, but
>it's because we have a special case string (e.g. "verified") conflicting
>with a more general case. I inserted a dummy token into the grammar to
>resolve these.
Yes, these are mostly covered by verbiage in the text (e.g. that "poster" is
not allowed as a newsgroup-name). But you did spot one I had not covered.
>> Appendix B - Collected Syntax
>>
>> Appendix B.1 - Characters, Atoms and Folding
>>
>> 2 CFWS = *([FWS] comment) (([FWS] comment) / FWS )
>Rewritten as:
> CFWS = (FWS / comment [FWS]) *(comment [FWS])
I presume that was just to keep yacc happy. I don't think there is
anything wrong with the present version (which is, in any case, taken
directly from RFC 2822).
>> 2 dot-atom = [CFWS] dot-atom-text [CFWS]
>> 2 dot-atom-text = 1*atext *( "." 1*atext )
>An atom is a specific case of a dot-atom. This causes a problem at one
>place: mailbox can begin with dot-atom (as part of an addr-spec) or with
>atom (as part of the display-name form of a name-addr).
Again, I think this is just because our grammar is not LALR (or whatever
it is that yacc accepts). Regarded as a CFG, the grammar is not ambiguous
here, and a parser with sufficient lookahead or backtracking would have
coped.
>> 2 dtext = NO-WS-CTL / ; Non white space controls
>> %d33-90 / ; The rest of the US-ASCII
>> %d94-126 ; characters not including
>> ; "[", "]", or "
>This permits double quote but excludes backslash. Allowing double quote is
>a problem with no-fold-literal. On the other hand, allowing double quote in
>a dcontent doesn't seem to cause a problem.
Oops! That's because the syntax of no-fold-literal is broken (see below).
>> 2 phrase = 1*word
>The string "ab" can be parsed as either one or two atoms, leading to an
>ambiguity. There is also a problem with attaching CFWS to the preceding or
>following word. As a lash-up, I did:
> phrase = [CFWS] 1*( phrase-item [CFWS] )
> phrase-item = (atext / DQUOTE *( [FWS] qcontent ) [FWS] DQUOTE )
Yes, but we can't change that because that syntax is taken from RFC 2822.
No harm arises (here or in RFC 2822), and the CFWS part of it is just an
example of the usual problem, as discussed above.
>Note that this is the only place where word, and thus atom, appears.
Same as in RFC 2822, I think.
>> 2 specials = "(" / ")" / ; Special characters used in
>> "<" / ">" / ; other parts of the syntax
>> "[" / "]" /
>> ":" / ";" /
>> "@" / "\" /
>> "," / "." /
>> DQUOTE
>This definition is never used except in a comment.
A convenience, as explained in RFC 2822.
>> strict-qtext = NO-WS-CTL / ; qtext restricted to
>> %d33 / ; US-ASCII
>> %d35-91 /
>> %d93-126
>Allowing %d93 (closing square bracket) causes a problem with no-fold-quote.
>The solution is to make it a specific additional alternative to
>strict-qcontent instead.
Oops! That's because the syntax of no-fold-quote is broken (see below).
>> 2* text = %d1-9 / ; all UTF-8 characters except
>> %d11-12 / ; US-ASCII NUL, CR and LF
>> %d14-127 /
>> <EOF> UTF8-xtra-char
>I take it that "<EOF>" is a typo ?
An "artefact" actually :-)
>> 5 tspecials = "(" / ")" / "<" / ">" / "@" /
>> "," / ";" / ":" / "\" / DQUOTE /
>> "/" / "[" / "]" / "?" / "="
>This definition is only used as part of the meta-definition of token-core.
>See the latter for details.
Indeed, as is done also in RFC 2045.
>> Appendix B.2 - Basic Forms
>>
>> {USENET}-header = {USENET}-name ":" SP {USENET}-content
>> *( ";" ( {USENET}-parameter /
>> other-parameter ) )
>The parameter can not be allowed for Organization, Subject, and Summary,
>where the {USENET}-content can contain a semicolon in free text.
Yes, there is verbiage in the text to deal with this. But when I go to the
non-template version (as Paul requests) that problem will go away.
>> 2 addr-spec = local-part "@" domain
>> 2 address = mailbox / group
>> 2 address-list = address *( "," address )
>> 2 angle-addr = [CFWS] "<" addr-spec ">" [CFWS]
>A display-name is a phrase, which is a sequence of words and so can end
>with CFWS. So remove the leading [CFWS] and put it into the first option of
>name-addr instead.
Another example of the usual problem. We are stuck with the way RFC 2822
does it.
>> 5* attribute = {USENET}-token / iana-token / x-token
>I had to remove {USENET}-token because it clashed with the various specific
>instances of such tokens. Since attribute is used only as the name of an
>other-parameter, does {USENET}-token belong here ?
Yes, I had already come to that conclusion. It will likely disappear
along with the templates.
>> header = {USENET}-header / other-header
>[[special case]]
Yup! An inevitable problem if you want to say "all the headers defined by
this standard conform to the syntax of other-header". RFC 2822 has it too
with its optional-field.
>> 5* iana-token = <A token defined in an experimental
>> or standards-track RFC and registered with
>> with IANA>
>For consistency with x-token this should read:
> iana-token = [CFWS] token-core [CFWS]
> ; the token-core must be one defined in an experimental
> or standards-track RFC and registered with IANA
No. It is taken from RFC 2045, so better to leave it that way.
>Note, by the way, that you have "with with".
Noted.
>> 5* token-core = 1*<any (US-ASCII) CHAR except SP, CTLs,
>> or tspecials>
>Can be rewritten as a specific definition:
> token-core = 1*("!" / %d35-39 / "*" / "+" / "-" / "." /
> DIGIT / ALPHA / %d94-96 / %d123-126 )
Again, it is better to stay with RFC 2045. Also with RFC 2616 which also
defines 'tspecials', but differently :-( .
>> Appendix B.3 - Headers
>>
>> Appendix B.3.1 - Template definitions
>>
>> {CONTROL}-verb = <the verb defined in this standard
>> (or an extension of it) for a specific
>> {CONTROL} message>
>> {CONTROL}-arguments = <the arguments defined in this standard
>> (or an extension of it) for a specific
>> {CONTROL} message>
>>
>> Appendix B.3.2 - Template instantiations
>>
>> Followup-To-content = Newsgroups-content / [FWS] "poster" [FWS]
>[[special case]]
Covered by verbiage.
>> Mail-Copies-To-content
>> = copy-addr / [CFWS] ( "nobody" / "poster" ) [CFWS]
>[[special case]]
I think not. A copy-addr MUST have an '@' somewhere inside it.
>> Path-content = [FWS] *( path-identity [FWS] path-delimiter [FWS] )
>> tail-entry [FWS]
>I can't get yacc to accept this as written. If the trailing [FWS] is made
>part of tail-entry instead, this resolves it.
I think this is just another yacc LALR problem. It is better left as it is
to make agreement with the "[FWS] at each end" invariant easier to verify.
>> Subject-content = [ [FWS] back-reference ] pure-subject
>The form with back-reference is a [[special case]] of the form without.
Covered by verbiage.
>> User-Agent-content = product-token *( CFWS product-token )
>Parsing problem - product-token, like other tokens, ends with [CFWS].
The usual suspect again.
>> Appendix B.3.3 - Other header rules
>>
>> arguments = *( CFWS value )
>This rule is never used.
See the rule for {CONTROL}-arguments.
>> article-locator = 1*( %x21-7E ) ; US-ASCII printable characters
>An article-locator can be the last thing in an Xref header, and so may be
>followed by CFWS or a parameter. It is necessary to exclude "(" and ";"
>from this definition.
1*( %x21-27 / %x29-3A / %x3C-7E )
>> host-value = dot-atom /
>> [ dot-atom ":" ]
>> ( dotted-quad / ; see
>> ipv6-numeric ) ; see
>Um, the referents appear to have gone. And in any case these should have
>syntax specified here. I note that they appear to be special cases of
>dot-atom.
OK, I have made a note to fix that.
>> moderation-flag = %x28.4D.6F.64.65.72.61.74.65.64.29
>> ; case sensitive "(Moderated)"
>[[special case]]
Verbiage needed in section 7.2.1.2.
>> 2 msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
>A msg-id can occur immediately before or after CFWS (in one case - IHAVE -
>it is required to be followed by a single SP). Therefore the leading
>and trailing [CFWS] need to be removed and placed in the relevant rules, if
>any.
No, I would rather fix the problem by allowing CFWS where it says SP in
IHAVE. I have no qualms about doing this, because we did exactly the same
thing in the References-header, where son-of-1036 specified a single SP,
but we put CFWS. Note there is a general "SHOULD NOT use comments where it
says CFWS" except where the previous standards allowed (see 4.2.4), so the
only question is whether multiple whitespaces and/or folding should be
permitted, both in references and ihave. I doubt any extant software has a
problem with that. Similarly, our draft allows CFWS in other control
messages (e.g. newgroup) where son-of-1036 has SP.
But on poking around in the code of CNews I found another can of worms.
The feature of allowing that list of msg-ids in the Ihave control message
is deprecated in favour of putting them in the body, although our draft
says MUST support both methods. However, AFAICS, CNews implements them
only in the body (it ignores any it finds in the header). Now if CNews
(which I suppose must be regardes as "state of the art" regarding UUCP
transport :-) ) does not obey that MUST, then perhaps we should downgrade
it, or even declare that method obsolete.
Perhaps Henry would care to comment? Does anyone know of actual usage of
the "old" method anywhere?
>> newsgroup-description
>> = 1*( [WSP] utext)
>There is a conflict between the leading WSP and the 1*HTAB that always
>precedes the description. Since leading space is presumably not part of the
>description, and since we presumably *do* want to allow multiple spaces,
>change it to:
> newsgroup-description = utext *( *WSP utext )
I have adopted your fix.
>and the second part of newsgroups-line to:
> [ HTAB *WSP newsgroup-description ]
But not that one. My belief is that current usage is for all the WSP
between the newsgroup-name and the newsgroup-description to consist of
HTABs. Comments anyone?
>> newsgroups-tag = %x46.6F.72 SP %x79.6F.75.72 SP
>> %x6E.65.77.73.67.72.6F.75.70.73 SP
>> %x66.69.6C.65.3A
>> ; case sensitive
>> ; "For your newsgroups file:"
>[[special case]]
I think not, because that string contains no HTAB.
But, in any case, you can't have a single-component newsgroup-name.
>> 2* no-fold-literal = DQUOTE *( dtext / strict-quoted-pair ) DQUOTE
>> 2* no-fold-quote = "[" *( strict-qtext / strict-quoted-pair ) "]"
These should have been:
2* no-fold-literal = "[" *( dtext / strict-quoted-pair ) "]"
2* no-fold-quote = DQUOTE *( strict-qtext / strict-quoted-pair ) DQUOTE
>> posting-date-parameter
>> = [CFWS] Posting-Date-token [CFWS] "=" [CFWS]
>> ( date-value /
>> DQUOTE date-value DQUOTE ) [CFWS]
>A date-value can end with CFWS, so move the trailing [CFWS] into the second
>option only.
It's another of those "usual suspects" cases. I think I prefer to leave it
as it is, for the benefit of that invariant (see remarks above under
Path-content).
>> posting-host-parameter
>> = [CFWS] Posting-Host-token [CFWS] "=" [CFWS]
>> ( host-value /
>> DQUOTE host-value DQUOTE ) [CFWS]
>A host-value can be a dot-atom and so begin and end with CFWS. So move both
>[CFWS] into the second option only.
Ditto.
>> posting-sender-parameter
>> = [CFWS] Posting-Sender-token [CFWS] "=" [CFWS]
>> ( sender-value /
>> DQUOTE sender-value DQUOTE ) [CFWS]
>Firstly, a sender_value can begin and end with CFWS, so move both [CFWS]
>into the second option. Secondly, and rather more worryingly, a
>sender-value can begin with a quoted-string, meaning that there's a parsing
>problem with a DQUOTEd sender-value.
Ditto for the first bit.
But the second is REAL NASTY :-( .
Essantially, you want to say:
sender=""Joe D. Bloggs" <jdbloggs@example.com"
Both sets of DQUOTEs are necessary, the outer ones because the parameter
contains whitespace, and the inner because '.' cannot appear in an atom.
BUT quoted-strings cannot be nested.
I have solved it for now by making the parameter just an addr-spec (no
difficulty there, because this is going to be filled in by injectors who
know who their senders are, in spite of John Stanley who says they cannot
know that). And generally speaking, it is the addr-spec that those
injectors will know (because anyone can lie about his name, so long as his
actual addr-spec is accurate).
But, sooner or later, the MIME people are going to run into this problem,
so maybe I shall ask around on the ietf-822 list.
>> sender-value = ( mailbox / "verified" )
>[[special case]]
I think not (no `@` in "verified" again).
>> verb = token
>This rule is never used.
See {CONTROL}-verb.
-- Charles H. Lindsey ---------At Home, doing my own thing------------------------ Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl Email: chl@clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K. PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5