Re: Collected syntax

New Message Reply About this list Date view Thread view Subject view Author view

From: Clive D.W. Feather (clive@demon.net)
Date: Tue Feb 26 2002 - 05:01:55 CST


Charles Lindsey said:
> The complete Collected Syntax as fixed is reproduced below.

I decided to test whether this syntax is actually unambiguous. So I
converted it all into yacc syntax mostly mechanically, then looked to see
what happened when I gave it to yacc.

Firstly, there is definitely a problem with the white space that I haven't
been able to track down yet, and don't really have the time to do so. But I
can say that it's to do with [CFWS] and CFWS being adjacent. If I replace
all uses of [CFWS] with [DUMMY CFWS] there are no problems (modulo the
other changes mentioned below). But if I remove DUMMY then I get a whole
load of conflicts in the grammar.

Secondly, yacc seems to be unable to cope with the idea that CRLF might be
part of folding white space or might be the end of a real line. For now
I've defined FWS as simply 1*WSP.

Then there's a whole load of specific problems which I discuss inline.

The note [[special case]] means that yacc complained about a conflict, but
it's because we have a special case string (e.g. "verified") conflicting
with a more general case. I inserted a dummy token into the grammar to
resolve these.

> Appendix B - Collected Syntax
>
> Appendix B.1 - Characters, Atoms and Folding
>
> In the following syntactic rules, nunbers in the left hand margin
> indicate rules taken from other documents, specifically:
> 2 from with the exception of those elements described therein as
> "obsolete";
> 4 from;
> 5 from.
>
> Where the number is followed by an asterisk ('*'), it indicates that
> the rule in question has been modified for the purposes of this
> standard.
>
> 4 ALPHA = %x41-5A / ; A-Z
> %x61-7A ; a-z
> 2 CFWS = *([FWS] comment) (([FWS] comment) / FWS )

Rewritten as:

     CFWS = (FWS / comment [FWS]) *(comment [FWS])

> 4 CR = %x0D ; carriage return
> 4 CRLF = CR LF
> 4 DIGIT = %x30-39 ; 0-9
> 4 DQUOTE = %d34 ; quote mark
> 2 FWS = ([*WSP CRLF] 1*WSP); Folding whitespace
> 4 HTAB = %x09 ; horizontal tab
> 4 LF = %x0A ; line feed
> 2 NO-WS-CTL = %d1-8 / ; US-ASCII control characters
> %d11 / ; which do not include the
> %d12 / ; carriage return, line feed,
> %d14-31 / ; and whitespace characters
> %d127
> 4 SP = %x20 ; space
> 4 WSP = SP / HTAB ; Whitespace characters
> UTF8-xtra-2-head = %xC2-DF
> UTF8-xtra-3-head = %xE0 %xA0-BF / %xE1-EC %x80-BF /
> %xED %x80-9F / %xEE-EF %x80-BF
> UTF8-xtra-4-head = %xF0 %x90-BF / %xF1-F7 %x80-BF
> UTF8-xtra-5-head = %xF8 %x88-BF / %xF9-FB %x80-BF
> UTF8-xtra-6-head = %xFC %x84-BF / %xFD %x80-BF
> UTF8-xtra-char = UTF8-xtra-2-head 1( UTF8-xtra-tail ) /
> UTF8-xtra-3-head 1( UTF8-xtra-tail ) /
> UTF8-xtra-4-head 2( UTF8-xtra-tail ) /
> UTF8-xtra-5-head 3( UTF8-xtra-tail ) /
> UTF8-xtra-6-head 4( UTF8-xtra-tail )
> UTF8-xtra-tail = %x80-BF
> 2 atext = ALPHA / DIGIT /
> "!" / "#" / ; Any character except
> "$" / "%" / ; controls, SP, and specials.
> "&" / "'" / ; Used for atoms
> "*" / "+" /
> "-" / "/" /
> "=" / "?" /
> "^" / "_" /
> "`" / "{" /
> "|" / "}" /
> "~"
> 2 atom = [CFWS] 1*atext [CFWS]
> 2 ccontent = ctext / quoted-pair / comment
> 2 comment = "(" *([FWS] ccontent) [FWS] ")"
> 2* ctext = NO-WS-CTL / ; all of <text> except
> %d33-39 / ; SP, HTAB, "(", ")"
> %d42-91 / ; and "\"
> %d93-126 /
> UTF8-xtra-char
> 2 dcontent = dtext / quoted-pair
> 2 dot-atom = [CFWS] dot-atom-text [CFWS]
> 2 dot-atom-text = 1*atext *( "." 1*atext )

An atom is a specific case of a dot-atom. This causes a problem at one
place: mailbox can begin with dot-atom (as part of an addr-spec) or with
atom (as part of the display-name form of a name-addr).

> 2 dtext = NO-WS-CTL / ; Non white space controls
> %d33-90 / ; The rest of the US-ASCII
> %d94-126 ; characters not including
> ; "[", "]", or "

This permits double quote but excludes backslash. Allowing double quote is
a problem with no-fold-literal. On the other hand, allowing double quote in
a dcontent doesn't seem to cause a problem.

> 2 phrase = 1*word

The string "ab" can be parsed as either one or two atoms, leading to an
ambiguity. There is also a problem with attaching CFWS to the preceding or
following word. As a lash-up, I did:

    phrase = [CFWS] 1*( phrase-item [CFWS] )
    phrase-item = (atext / DQUOTE *( [FWS] qcontent ) [FWS] DQUOTE )

Note that this is the only place where word, and thus atom, appears.

> 2 qcontent = qtext / quoted-pair
> 2* qtext = NO-WS-CTL / ; all of <text> except
> %d33 / ; SP, HTAB, "\" and DQUOTE
> %d35-91 /
> %d93-126 /
> UTF8-xtra-char
> 2 quoted-pair = "\" text
> 2 quoted-string = [CFWS] DQUOTE
> *( [FWS] qcontent ) [FWS]
> DQUOTE [CFWS]
> 2 specials = "(" / ")" / ; Special characters used in
> "<" / ">" / ; other parts of the syntax
> "[" / "]" /
> ":" / ";" /
> "@" / "\" /
> "," / "." /
> DQUOTE

This definition is never used except in a comment.

> strict-qcontent = strict-qtext / strict-quoted-pair
> strict-quoted-pair = "\" strict-text
> strict-quoted-string
> = [CFWS] DQUOTE
> *( [FWS] strict-qcontent ) [FWS]
> DQUOTE [CFWS]
> strict-qtext = NO-WS-CTL / ; qtext restricted to
> %d33 / ; US-ASCII
> %d35-91 /
> %d93-126

Allowing %d93 (closing square bracket) causes a problem with no-fold-quote.
The solution is to make it a specific additional alternative to
strict-qcontent instead.

> strict-text = %d1-9 / ; text restricted to
> %d11-12 / ; US-ASCII
> %d14-127
> 2* text = %d1-9 / ; all UTF-8 characters except
> %d11-12 / ; US-ASCII NUL, CR and LF
> %d14-127 /
> <EOF> UTF8-xtra-char

I take it that "<EOF>" is a typo ?

> 5 tspecials = "(" / ")" / "<" / ">" / "@" /
> "," / ";" / ":" / "\" / DQUOTE /
> "/" / "[" / "]" / "?" / "="

This definition is only used as part of the meta-definition of token-core.
See the latter for details.

> 2* utext = NO-WS-CTL / ; Non white space controls
> %d33-126 / ; The rest of US-ASCII
> UTF8-xtra-char
> 2 word = atom / quoted-string
>
>
>
> Appendix B.2 - Basic Forms
>
> {USENET}-header = {USENET}-name ":" SP {USENET}-content
> *( ";" ( {USENET}-parameter /
> other-parameter ) )

The parameter can not be allowed for Organization, Subject, and Summary,
where the {USENET}-content can contain a semicolon in free text.

> 2 addr-spec = local-part "@" domain
> 2 address = mailbox / group
> 2 address-list = address *( "," address )
> 2 angle-addr = [CFWS] "<" addr-spec ">" [CFWS]

A display-name is a phrase, which is a sequence of words and so can end
with CFWS. So remove the leading [CFWS] and put it into the first option of
name-addr instead.

> article = 1*( header CRLF ) separator body
> 5* attribute = {USENET}-token / iana-token / x-token

I had to remove {USENET}-token because it clashed with the various specific
instances of such tokens. Since attribute is used only as the name of an
other-parameter, does {USENET}-token belong here ?

> body = *( *998text CRLF )
> 2 display-name = phrase
> 2 date = day month year
> 2 date-time = [ day-of-week "," ] date FWS time [CFWS]
> 2 day = [FWS] 1*2DIGIT
> 2 day-name = "Mon" / "Tue" / "Wed" / "Thu" /
> "Fri" / "Sat" / "Sun"
> 2 day-of-week = [FWS] day-name
> 2 domain = dot-atom / domain-literal
> 2 domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
> 2 group = display-name ":" [ mailbox-list / CFWS ] ";"
> [CFWS]
> header = {USENET}-header / other-header

[[special case]]

> header-name = 1*name-character *( "-" 1*name-character )
> 2 hour = 2DIGIT
> 5* iana-token = <A token defined in an experimental
> or standards-track RFC and registered with
> with IANA>

For consistency with x-token this should read:

    iana-token = [CFWS] token-core [CFWS]
        ; the token-core must be one defined in an experimental
          or standards-track RFC and registered with IANA

Note, by the way, that you have "with with".

> 2* local-part = dot-atom / strict-quoted-string
> 2 mailbox = name-addr / addr-spec
> 2 mailbox-list = mailbox *( "," mailbox )
> 2 minute = 2DIGIT
> 2 month = FWS month-name FWS
> 2 month-name = "Jan" / "Feb" / "Mar" / "Apr" /
> "May" / "Jun" / "Jul" / "Aug" /
> "Sep" / "Oct" / "Nov" / "Dec"
> 2 name-addr = [display-name] angle-addr
> name-character = ALPHA / DIGIT
> other-header = header-name ":" 1*SP other-content
> other-content
> = <the content of a header defined by some
> other standard>
> other-parameter
> = attribute "=" value
> 2 second = 2DIGIT
> separator = CRLF
> 2 time = time-of-day FWS zone
> 2 time-of-day = hour ":" minute [ ":" second ]
> 5* token = [CFWS] token-core [CFWS]
> 5* token-core = 1*<any (US-ASCII) CHAR except SP, CTLs,
> or tspecials>

Can be rewritten as a specific definition:

    token-core = 1*("!" / %d35-39 / "*" / "+" / "-" / "." /
                    DIGIT / ALPHA / %d94-96 / %d123-126 )

or similar.

> 5 value = token / quoted-string
> 5* x-token = [CFWS] "x-" token-core [CFWS]
> 2 year = 4*DIGIT
> 2* zone = (( "+" / "-" ) 4DIGIT) / "UT" / "GMT"
>
> Appendix B.3 - Headers
>
> Appendix B.3.1 - Template definitions
>
> {CONTROL}-verb = <the verb defined in this standard
> (or an extension of it) for a specific
> {CONTROL} message>
> {CONTROL}-arguments = <the arguments defined in this standard
> (or an extension of it) for a specific
> {CONTROL} message>
> {USENET}-content
> = <the content of a header defined in this
> standard (or an extension of it) for a
> specific {USENET}-header>
> {USENET}-name
> = <a header-name defined in this standard
> (or an extension of it) for a specific
> {USENET}-header>
> {USENET}-parameter
> = <an other-parameter defined in this standard
> (or an extension of it) for a specific
> {USENET}-header>
> {USENET}-token = <a token defined in this standard for
> use in conjunction with a specific
> {USENET}-parameter>
>
> Appendix B.3.2 - Template instantiations
>
> Approved-content = From-content
> Approved-name = "Approved"
> Archive-content = [CFWS] ("no" / "yes" ) [CFWS]
> Archive-name = "Archive"
> Archive-parameter = Filename-token "=" value
> Cancel-arguments = CFWS msg-id
> Cancel-verb = "cancel"
> Checkgroup-arguments = [ chkscope ] [ chksernr ]
> Checkgroup-verb = "checkgroups"
> Complaints-To-content= address-list
> Complaints-To-name = "Complaints-To"
> Control-content = [CFWS] {CONTROL}-verb {CONTROL}-arguments [CFWS]
> Control-name = "Control"
> Date-content = date-time
> Date-name = "Date"
> Distribution-content = distribution *( dist-delim distribution )
> Distribution-name = "Distribution"
> Expires-content = date-time
> Expires-name = "Expires"
> Filename-token = [CFWS] "filename" [CFWS]
> Followup-To-content = Newsgroups-content / [FWS] "poster" [FWS]

[[special case]]

> Followup-To-name = "Followup-To"
> From-content = mailbox-list
> From-name = "From"
> Ihave-arguments = *( msg-id SP ) relayer-name
> Ihave-verb = "ihave"
> Injector-Info-content= [CFWS] path-identity [CFWS]
> Injector-Info-name = "Injector-Info"
> Injector-Info-parameter
> = posting-host-parameter /
> posting-account-parameter /
> posting-sender-parameter /
> posting-logging-parameter /
> posting-date-parameter
> Keywords-content = phrase *( "," phrase )
> Keywords-name = "Keywords"
> Lines-content = [CFWS] 1*DIGIT [CFWS]
> Lines-name = "Lines"
> Mail-Copies-To-content
> = copy-addr / [CFWS] ( "nobody" / "poster" ) [CFWS]

[[special case]]

> Mail-Copies-To-name = "Mail-Copies-To"
> Message-ID-content = msg-id
> Message-ID-name = "Message-ID"
> Mvgroup-arguments = CFWS newsgroup-name CFWS newsgroup-name
> [ CFWS newgroup-flag ]
> Mvgroup-verb = "mvgroup"
> Newgroup-verb = "newgroup"
> Newgroup-arguments = CFWS newsgroup-name [ CFWS newgroup-flag ]
> Newsgroups-content = [FWS] newsgroup-name
> *( [FWS] ng-delim [FWS] newsgroup-name )
> [FWS]
> Newsgroups-name = "Newsgroups"
> Organization-content
> = 1*( [FWS] utext )
> Organization-name = "Organization"
> Path-content = [FWS] *( path-identity [FWS] path-delimiter [FWS] )
> tail-entry [FWS]

I can't get yacc to accept this as written. If the trailing [FWS] is made
part of tail-entry instead, this resolves it.

> Path-name = "Path"
> Posted-And-Mailed-content
> = [CFWS] ( "yes" / "no" ) [CFWS]
> Posted-And-Mailed-name
> = "Posted-And-Mailed"
> Posting-Account-token= "posting-account"
> Posting-Date-token = "posting-date"
> Posting-Host-token = "posting-host"
> Posting-Logging-token= "logging-data"
> Posting-Sender-token = "sender"
> References-content = msg-id *( CFWS msg-id )
> References-name = "References"
> Reply-To-content = address-list
> Reply-To-name = "Reply-To"
> Rmgroup-arguments = CFWS newsgroup-name
> Rmgroup-verb = "rmgroup"
> Sender-content = mailbox
> Sender-name = "Sender"
> Sendme-arguments = Ihave-arguments
> Sendme-verb = "sendme"
> Subject-content = [ [FWS] back-reference ] pure-subject

The form with back-reference is a [[special case]] of the form without.

> Subject-name = "Subject"
> Summary-content = 1*( [FWS] utext )
> Summary-name = "Summary"
> Supersedes-content = msg-id
> Supersedes-name = "Supersedes"
> User-Agent-content = product-token *( CFWS product-token )

Parsing problem - product-token, like other tokens, ends with [CFWS].

> User-Agent-name = "User-Agent"
> Xref-content = [CFWS] server-name 1*( CFWS location ) [CFWS]
> Xref-name = "Xref"
>
> Appendix B.3.3 - Other header rules
>
> arguments = *( CFWS value )

This rule is never used.

> article-locator = 1*( %x21-7E ) ; US-ASCII printable characters

An article-locator can be the last thing in an Xref header, and so may be
followed by CFWS or a parameter. It is necessary to exclude "(" and ";"
from this definition.

> article-size = 1*DIGIT
> back-reference = %x52.65.3A.20
> ; which is a case-sensitive "Re: "
> batch = 1*( batch-header article )
> batch-header = "#!" SP rnews SP article-size CRLF
> checkgroups-body = *( valid-group CRLF )
> chkscope = 1*( CFWS ["!"] newsgroup-name )
> chksernr = CFWS "#" 1*DIGIT
> combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
> combiner-base = combiner-ASCII / combiner-extended
> combiner-extended = <any character with a Unicode code value of
> 0080 or greater and a combining class of 0,
> but excluding any character in Unicode
> categories Cc, Cf, Cs, Zs, Zl, and Zp>
> combiner-mark = <any character with a Unicode code value of
> 0080 or greater and a combining class other
> than 0>
> component = 1*component-glyph
> component-glyph = combiner-base *combiner-mark
> copy-addr = address-list
> date-value = 1*DIGIT [ ":" date-time ]
> dist-delim = ","
> distribution = positive-distribution /
> negative-distribution
> distribution-name = ALPHA 1*distribution-rest
> distribution-rest = ALPHA / "+" / "-" / "_"
> groupinfo-body = [ newsgroups-tag CRLF ]
> newsgroups-line CRLF
> host-value = dot-atom /
> [ dot-atom ":" ]
> ( dotted-quad / ; see
> ipv6-numeric ) ; see

Um, the referents appear to have gone. And in any case these should have
syntax specified here. I note that they appear to be special cases of
dot-atom.

> 2 id-left = dot-atom-text / no-fold-quote
> 2 id-right = dot-atom-text / no-fold-literal
> ihave-body = *( msg-id CRLF )
> location = newsgroup-name ":" article-locator
> moderation-flag = %x28.4D.6F.64.65.72.61.74.65.64.29
> ; case sensitive "(Moderated)"

[[special case]]

> 2 msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]

A msg-id can occur immediately before or after CFWS (in one case - IHAVE -
it is required to be followed by a single SP). Therefore the leading
and trailing [CFWS] need to be removed and placed in the relevant rules, if
any.

> negative-distribution
> = [FWS] "!" distribution-name [FWS]
> newgroup-flag = "moderated"
> newsgroup-description
> = 1*( [WSP] utext)

There is a conflict between the leading WSP and the 1*HTAB that always
precedes the description. Since leading space is presumably not part of the
description, and since we presumably *do* want to allow multiple spaces,
change it to:

    newsgroup-description = utext *( *WSP utext )

and the second part of newsgroups-line to:

                              [ HTAB *WSP newsgroup-description ]

> newsgroup-name = component *( "." component )
> newsgroups-line = newsgroup-name
> [ 1*HTAB newsgroup-description ]
> [ 1*WSP moderation-flag ]
> newsgroups-tag = %x46.6F.72 SP %x79.6F.75.72 SP
> %x6E.65.77.73.67.72.6F.75.70.73 SP
> %x66.69.6C.65.3A
> ; case sensitive
> ; "For your newsgroups file:"

[[special case]]

> ng-delim = ","
> 2* no-fold-literal = DQUOTE *( dtext / strict-quoted-pair ) DQUOTE
> 2* no-fold-quote = "[" *( strict-qtext / strict-quoted-pair ) "]"
> path-delimiter = "/" / "?" / "%" / "," / "!"
> path-identity = 1*( ALPHA / DIGIT / "-" / "." / ":" / "_" )
> positive-distribution
> = [FWS] distribution-name [FWS]
> posting-account-parameter
> = [CFWS] Posting-Account-token" [CFWS] "=" value
> posting-date-parameter
> = [CFWS] Posting-Date-token [CFWS] "=" [CFWS]
> ( date-value /
> DQUOTE date-value DQUOTE ) [CFWS]

A date-value can end with CFWS, so move the trailing [CFWS] into the second
option only.

> posting-host-parameter
> = [CFWS] Posting-Host-token [CFWS] "=" [CFWS]
> ( host-value /
> DQUOTE host-value DQUOTE ) [CFWS]

A host-value can be a dot-atom and so begin and end with CFWS. So move both
[CFWS] into the second option only.

> posting-logging-parameter
> = [CFWS] Posting-Logging-token [CFWS] "=" value
> posting-sender-parameter
> = [CFWS] Posting-Sender-token [CFWS] "=" [CFWS]
> ( sender-value /
> DQUOTE sender-value DQUOTE ) [CFWS]

Firstly, a sender_value can begin and end with CFWS, so move both [CFWS]
into the second option. Secondly, and rather more worryingly, a
sender-value can begin with a quoted-string, meaning that there's a parsing
problem with a DQUOTEd sender-value.

> product-token = value [ "/" product-version ]
> product-version = value
> pure-subject = 1*( [FWS] utext )
> relayer-name = path-identity
> rnews = %x72.6E.65.77.73 ; case sensitive "rnews"
> sender-value = ( mailbox / "verified" )

[[special case]]

> sendme-body = ihave-body
> server-name = path-identity
> tail-entry = 1*( ALPHA / DIGIT / "-" / "." / ":" / "_" )
> valid-group = newsgroups-line
> verb = token

This rule is never used.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive@davros.org>  | Fax:  +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            | NOTE: fax number change


New Message Reply About this list Date view Thread view Subject view Author view


This archive was generated by hypermail 2b29.