[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Single-Master Failover Recovery and Automated Replication



This a problem if the master doesn't use a full phase commit for each
database operation - i.e replicate the change to all slaves before total
commit.  If this not the case, then the master can have a backlog of
changes which have not yet been replicated to the slaves.

If one of these slaves is upgraded to a master server, then converting the
original master to a slave is problematic, since the old master has a list
of changes which have not been committed to the new master.

Basically, this is saying that in this case, administrative action is
necessary to correct this problem.  It is possible to automate this case. 
You have several choices of action at this point:

1) Roll back all changes from the old master which had not been replicated
to the new master.  Either transactional rollback or wipe-overwrite-sync of
the database is required.

2) Merge changes from the old master into the new master, using conflict
resolution, either automated or manual.

3) Merge changes from the new master into the old master, using conflict
resolution, and return the old master to service, downgrading the new master.

4) Discard changes from the new master and return the old master to service.

5) Require manual administrative repair

Each of these has various drawbacks - automated conflict resolution may not
always be appropriate; implementing some of these is more difficult than
others; many slaves may need to be reconfigured; some set of changes may be
lost; manual repair may be tedious and prone to irregular human errors.

You need to decide what tradeoffs are appropriate based on the application
you are using.  Since different applications may have different
requirements in this area, any interoperable protocol for automated
failover recovery should not preclude any of these choices.  Designing a
recovery protocol that encompasses all of these options is a rather large
problem, which is probably why it hasn't been formalized yet (at least not
in a public forum).

In any case, you are going to need to extend the HA software and the LDAP
servers to be intelligent enough to perform one of these actions.

Zachary Amsden

---- Original message ----
>Date: Mon, 14 Oct 2002 13:13:34 +0100
>From: "Hugo Tavares" <htavares@xxxxxxxxxxxxxx>  
>Subject: Single-Master Failover Recovery and Automated Replication  
>To: <ietf-ldup@xxxxxxx>
>
>
>Greetings
>
>I'm doing some search before implementing the LDAP services and I've noticed
>that there is a big question in synchronizing the information after recovery
>from a master fail in a single-master environment. As it's written in
>section 5.2 of "General Usage Profile for LDAPv3 Replication":
>
>"If the broken master is returned to service as a slave,
> then the administrator must, external to LDUP, distribute and resolve
> whatever pending changes remained undistributed and unresolved from the
> time immediately before it was removed from service. If the broken
> master is returned as a new master, then care must be taken with its
> replacement master to ensure that all of its pending changes are
> distributed and resolved before it is returned to duty as a slave."
>
>For a failover recovery I will use the heartbeat protocol with linux-HA or
>the LVS software (I still don't know what to use), and this protocol seems
>to deal fine the services, since
>when the masters goes down the heartbeat puts the slave being the master for
>providing updates to the clients, but, when the master comes up the
>heartbeat puts everything like the original configuration, although the
>problem of synchronization stills.
>
>My questions are:
>
Is there any protocol wich leads with this? (I don't think so)
>
Do you have any perspectives for implementing an autometed system to lead
>with this?
>
>
>Thank you
>
>Hugo Tavares
>

"A plague upon all your houses" - last words of Waldo Semon, inventor of vinyl