[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RFC1154 line counts again
Would someone please explain to me what a "line" is, and why this Internet
transmission concept is meaningful at the RFC-822 level?
This is a somewhat facetious question, but it conveys a serious intent and a
serious problem in any environment where a "line" is not what the Internet
RFC's considers to be a "line".
The RFC definition of a "line", "a sequence of zero or more bytes terminated
by an ASCII carriage return followed immediately by an ASCII line feed" comes
from the days where the majority ARPANET system was some flavor of PDP-10
operating system, all of which used this convention. This definition
corresponds to the actual ASCII character sequence needed to effect a new line
on most unit record character printers and displays. The needs of those
record-oriented operating systems (e.g. IBM OS/360) were largely ignored.
Although most terminals still use that definition of a line, most operating
systems in use today do not. In particular, the majority operating system on
the network today appears to be some flavor of Unix.
On Unix, a "line" is "a sequence of zero or more bytes terminated by an ASCII
line feed." No carriage return is involved.
There has always been a compatibility problem on Unix having to do with
incoming text from a CR/LF convention operating system that has embedded and
significant bare LF's. The bare LF's become indistinguishable from newlines
on Unix. Fortunately, most text files which contain such bare LF's merely do
so as a way of conserving space in the event of multiple consecutive newlines
instead of as an explicit vertical formatter with no horizontal motion. But
this cannot be guaranteed. No assumptions can be made.
So, the following document is sent from CR/LF system A to Unix system B:
Hi there<CRLF><LF>This<LF>is<LF>a<LF>test!<CRLF><LF>Bye!<CRLF>
This should appear as:
Hi there
This
| is
| a
| test!
Bye!
although on Unix the lines indicated with "|" are likely to be flushed to the
left instead of indented as shown.
Now, the question is:
How many lines are there in the above example?
I can make a case for there being 3, 5, or 8 lines.
System A would probably consider it to be 3 (strict Internet) or 5 (knowing
that two of those <LF>'s are space-conserved newlines) lines. System B would
call it 8 lines.
The moral of this sad story is that line counts are only trustworthy between
operating systems with identical notions of what constitutes a "line" and most
importantly when communicating with a system which has a completely reversible
transform between its local concept of a line and the communications concept
of a line.
Unix is not such a system. Nor can this problem be thrown in the face of
those implementing e-mail software on Unix. I am a UA implementor. The SMTP
server and mailer are competely separate from my software. I have no control
over what the SMTP server or mailer does to the bits before I get ahold of
them. I would dearly love to completely rewrite the BSD Unix mailer and SMTP
server (my vendetta against sendmail goes back to when I was a TOPS-20 email
hacker!) but I don't envision my boss allowing me to do so in the foreseeable
future.
There is still a TOPS-20 or two which can be enlisted to send RFC-legal email
that will break any existing RFC-1154 parser. I haven't even begun on all the
ramifications of systems which use CR as the newline marker instead of CRLF or
LF.
CONCLUSION:
There are legitimate technical problems with line counts. What's more, enough
people, for whatever reason, have objected to line counts. These people
include many of the important implementors of email around the network. Their
good way is necessary for the widespread adoption of an RFC-1154 type of
facility.
I have written an RFC-1154 implementation. I am not satisfied with my own
handling of line counts, and would prefer to deal with the apparently
insurmountable technical problems by using a different mechanism. Thank you
for your consideration.