|
There are several residual Internet-Drafts under review by
the Ediint vender and user community that are responding to industry requests
for standardization. Most of these Internet-Drafts are informational RFCs and
not “official” Ediint chartered efforts. Examples include drafts on
Compression, the Features header, Certificate Exchange messages (CEM), filename
transmission, multipart payload support, and Reliability for AS2. The question
here concerns a small but important point in the AS2 reliability draft. AS2 has an option for either synchronous or asynchronous
MDNs. The issue for comment is concerned with synchronous MDN mode. Most venders have attempted to provide for some recovery
from network and/or server failures, and also to protect their customers from
resource exhaustion. When synchronous MDNs are used to transfer large amounts
of business data with compression, digital signatures, and encryption applied
to that data, heavily loaded systems can take a large amount of time to produce
the MDN to send back in the HTTP response. The HTTP connection then needs to be
held open for an unpredictable amount of time, using resources on both sides. Now, because it is possible for an AS2 application to become
“hung” on the server side, software engineers often build in a “timer”
that closes a connection after some period of time. Unfortunately, the timeout can
occur before the HTTP requester (client) has received the protocol’s HTTP
response. In addition, sometimes various HTTP intermediaries (tunnels/proxies/gateways/etc)
may time out a connection along the path from client to final HTTP server based
on “inactivity,” and again prevent the completion of the HTTP
protocol. These exceptional conditions may be tied to an exception
handler that retries the HTTP request with its large payload. More often than
not, this retry of a large payload to an ever increasingly loaded server is a recipe
for further failure (and retry). Because AS2 payloads are growing from the
tens to hundreds of megabytes, and the AS2 traffic on existing servers is
growing, the “timeout/retry” spiral has become an operational
difficulty for AS2 systems that needs consideration. The AS2 specification does have a built-in solution for this
problem—asynchronous MDN mode. However, users have indicated an interest
in whether there is anything else that might be done to address the timeout
problem and make AS2 in synchronous MDN mode more reliable. One direction is to try to make the timeout interval value “flexible”
and adapt it intelligently. While both transmission time and payload size are
known to the sender, the receiver load (often the most critical factor) is not
known. So it becomes difficult to arrive at an intelligent solution that will
not sometimes be wrong, which tends to not satisfy AS2 endusers. Another direction might be to prohibit timeouts. This
solution would remove protections against tying up resources (both on sender
and receiver sides) in the really exceptional situations of a hung or dead thread/process
that did not clean up with an appropriate HTTP status code (5xx range). Again
there would be resistance to the adoption of this solution by developers and
engineers. Another direction might be to prohibit retries when using
synchronous MDNs. This direction effectively gives up on AS2 reliability. When
the specific error condition is recoverable (server down, connection refused,
transient network error, server temporarily busy), then retry can be a
reasonable way to enhance automation and reduce the need for operational
intervention and special manual handling. If the basic problem of the “timeout/retry” spiral
is that there is no way to tell intermediaries or the client that there is
forward progress being made on completing the HTTP response, then one remaining
direction is to provide a forward progress indicator. The HTTP protocol does
have an option for providing this feature that takes advantage of the HTTP
response “100 continue” status. In other words, a HTTP server can
be configured to send a sequence of “100 continue” replies, and a
HTTP 1.1 client is effectively instructed to wait for a reply in the success
range (“2xx”) or possibly failure (“5xx”). [ 3xx and
4xx cases ignored here for simplicity—these statuses should be given as
an initial HTTP response IMO.] This solution does not magically create
resources when they are falling short but at least it does potentially avoid
the “retry/timeout” spiral. Recommending that AS2 reliability makes use of this “keep
alive” or “forward progress” indicator would mark a change in
current operational modes. It is to be expected that this capability would be
marked by a special feature value (or AS2-version number if the feature header
is not approved) to allow a smooth transition to interoperability. Also, how frequently
to send 100 continues and how to react to a stretch of time without “100
continues” are issues needing consensus from the participants on
this list. This is assuming that people support the direction here proposed so
stakeholders should let their views be known! |