[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: 10. "Report propagation sequence"



>[Albert]
>Likewise there is no difficulty for *any* entry based
>implementation in simply maintaining an index of entries by
>local record sequence number.
>
>Therefore I argue that adding a local record sequence number
to
>enable propagating updates in strict order and using
>HighwaterMarks between neighbours should be the architectural
>change, rather than only requiring strict ordering within
each
>originator replicaId.
>
>Do you now agree?

[Zach]
Yes, adding a local sequence number merely requires an in
order merge for a log based implementation, and this will be
efficient enough when using HighwaterMarks.

I would propose instead though for an ordering requirement:
propagating updates in strictly increasing CSN order for each
replica ID.

Note that this does not impose the constraint that we must
cycle through each replica, and send all changes from that
replica before sending changes from another.  It does not
forbid it either.  So an entry based implementation could
simply transmit changes in the order received, and a log based
could transmit in-order by replicaID.  Both of these transmit
orders must produce the same result state for the database,
because they can both happen naturally (unless our URP rules
are horribly broken).

[Albert]
Ok, as I see it, we are both agreed that adding a local
sequence number and HighwaterMarks and requiring propagation
in the exact same order as received would not be a problem for
either entry based or log based implementations. Likewise
we are both agreed that adding HighwaterMarks would improve
performance.

However you still maintain that it is sufficient
to require that the originator sequence numbers for each
replicaId be maintained in order, while I maintain that the
record numbers of updates should be assigned by the consumer
as the local sequence numbers as the updates are received,
and then used as the message numbers of updates that are
supplied to neighbour consumers, so that the *exact* sequence
is preserved. 

I also maintain that the originator and local sequence number
(identical) for any local changes should be assigned from the
same sequence number generator, so that local changes are
also interleaved into the *exact* propagation sequence.

I agree that this does not contribute to saving bandwidth,
and that the HighwaterMarks and local sequence numbers
alone are sufficient for both improving performance and
avoiding restarts on connection failure, without requiring
*exact* correlation of the incoming and outgoing sequences.

I will now explain the other reasons for maintaining that
*exact* correlation below, as it is closely related to the
severe problems you discuss below as to what happens when
a server crashes.

Essentially, I am trying to maintain an *exact* hop by hop
sequence which has certain global properties, even though
there is no corresponding exact global sequence (because
of parallel and transitive routes, multiple sources etc).

The obvious global properties include maintaining the
sequence within each originator replicaId and also ensuring
that parents are created before children (even though
children may be created at a different replica). Both of
these can be achieved without the *exact* sequence I am
proposing, although the second may be simplified by it.

The less obvious global property that I am aiming for is
that a DSA crash can never result in other DSAs receiving
later updates that have been applied to an entry, without
also receiving any earlier changes that those later changes
were dependent on. I am using "earlier" and "later" in the
the context of the version number tree maintained by MDCR,
and ensuring a global property of the report propagation
sequence that substitutes for the *illusion* of a uniform
flow of time reflected in CSNs as used in the current
architecture.

Up to now we have been discussing restarts and performance
issues in a context equally applicable to the current
architecture for report propagation.

I must now ask you, (and anyone else joining this thread)
to carefully study the alternative Coda/Active Directory
based report propagation procedures proposed in sections
6 and 7 of my MDCR draft, as they are essential for
understanding the issues below.

http://www.ietf.org/internet-drafts/draft-langer-ldup-mdcr-00.txt

Section 4 should also be read for background. Hmm, I just
noticed that there is no section 5, so that is pages 4-12.

Note that I am proposing report propagation should be
entirely independent of update processing, with all
reports delivered to all replicas unaltered, and
new reports generated only by originators for local
changes but never as a result of applying updates en route.

This is equally applicable whether changes are ultimately
merged at each replica as with URP, or conflicts are resolved,
as with MDCR (section 8).

The current architecture does not support any requirement for
eventual convergence as it was written oblivious to the facts
that DSAs sometimes crash and get restored from backups and
that clocks are not necessarily even monotonic.

Steve has indicated that he will be drafting report propagation
procedures taking those facts into account for URP, but at present
the Coda/Active Directory report propagation procedures
proposed in MDCR are the only ones available, and they are
equally applicable to URP. Assuming that the WG will require
eventualy convergence under *all* circumstances, I think it
is therefore reasonable to ask people to carefully study the
only proposal currently available for achieving that, even
though it is not yet on the WG agenda.

[Zach]
What initially brought me to this discussion was a quite
different idea, though.

What really worried me about the current phrasing of the draft
was the fact that no ordering requirements were made at all
within a transmission.  This causes severe problems when a
replication update is interrupted.

I'm not so much worried about bandwidth, but the following
scenario:

Replica A sends a very long partial update to replica B. 
Somewhere in the middle of the transfer, the connection is
terminated.  Since we can get CSNs out of order within a
replica ID, we have three options:
   1)  Save all changes until the transfer is complete, then
       commit them and send an end replication response.  This
       still worries me because committing the changes may
       take a very long time, during which our peer may decide
       we are dead, and drop the connection.

[Albert]
Likewise my primary concern is not about bandwidth, although
this particular thread started from that issue of restarts
(and spikes from full replication). The proposal for *exact*
sequencing is primarily directed at a purge mechanism that
can guarantee eventual convergence. Please review the previous
discussion of that primary concern in 4. "Eventual convergence
- Version numbers or timestamps":

http://www.imc.org/ietf-ldup/mail-archive/msg00616.html

and in issue C, "Convergence" of:

http://www.imc.org/ietf-ldup/mail-archive/msg00641.html

See also 5. "Oscillation", especially:

http://www.imc.org/ietf-ldup/mail-archive/msg00650.html

I agree with you that option 1 is undesirable and in fact regard
it as unacceptable as a general solution. Some other option MUST
be available to implementors.

[Zach]
   2)  Commit changes as we get them.  Very problematic, since
       there is nothing prohibiting LDAP updates to our local
       replica while we are in a replication session.  So when
       the connection dies, we have no idea what the current
       update vector for our replica is, and we can't easily
       back out changes because we may have received updates.

[Albert]
Since both option 1 and option 3 are unacceptable as a general
solution, option 2 is the only option that remains for general use,
however problematic.

With the current architecture, option 2 is not merely "very
problematic", but "completely broken".

The CSNs generated by any supplier replica and/or expected by any
consumer replica may in fact move backwards due to errors in setting
time zones and daylight savings. (BTW year 2K testing would have
resulted in any test changes becoming stuck until this millenium,
if the WG had actually achieved its original timetable
for deployment of standards last millenium, as there is no way to
get rid of a change with a "future" CSN).

A DSA that crashes will normally have local changes that have not
been backed up, some of which will have been replicated to some
neighbours, and others of which will not have been replicated to
any. When it is restored from local backups to an earlier state,
it could attempt to resynchronize with its neigbours by a partial
replication, but this may fail for any of its own changes that
are outside the time window allowed for purging, as a result of
a substantial delay before restoration. Thus a full replication
to resynchronize becomes necessary, with a consequent bandwidth
spike, despite having an available local backup that could have
been used to allow a partial replication if the report propagation
architecture was not broken.

Even if a partial replication to resynchronize is possible, any
local changes that were not replicated before the crash, but were
backed up, may not replicate to other DSAs, preventing convergence.

These complex issues were dealt with in the Coda research adopted
by Active Directory which I am proposing should be adopted as the
basis for report propagation by this WG.

Their solution does require *exact* ordering and I seriously doubt
that anybody will come up with a better one that does not.

Until somebody does, since we are both agreed that exact ordering
would not be a problem for either log based or entry based
implementations, can we also agree that it should be tentatively
adopted as the design, pending some other alternative that can also
be proved to guarantee eventual convergence under *all* circumstances?

   3)  Lock down our local replica from updates during the
       replication process.  For some LDAP applications, this
       will not be an option.  We may not know which replica
       to send update referrals too, or our clients may not
       be able to chase referrals for any number of reasons.
       Not only this, but if the connection does die, we can't
       allow updates until replication is re-established with
       the same supplier, or we revert all the changes and
       restore to a known state.

Since there isn't actually enough information in a change
record to undo the change, any server crash during this
process is rather disastrous.  This means we need to store a
reverse operation for each change as well.

[Albert]
I agree that this is unacceptable. Refusing DUA updates is
contemplated in the current architecture (with a "MAY") only
for a full replication, which should only be necessary when
a DSA is already offline anyway for that replication area
(eg due to a crash). I believe that is correct and going offline
to DUAs during incremental replication would completely negate
the point of multi-master.

BTW, by replicating operations rather than primitives, MDCR does
carry enough information to be able to reverse changes.

[Zach]
These are the kind of warts I would like to avoid in the LDUP
protocol.

[Albert]
These warts arise directly from using CSNs based on timestamps
instead of version numbers, and from not maintaining an
*exact* sequence when replicating. Surgery is required.