Database replication - Scott Streit Content by wangping12


									Database replication is the creation and maintenance of multiple copies of the same database.
In most implementations of database replication, one database server maintains the master copy of the
database and additional database servers maintain slave copies of the database.
Database writes are sent to the master database server and are then replicated by the slave database
Database reads are divided among all of the database servers, which results in a large performance
advantage due to load sharing.
In addition, database replication can also improve availability because the slave database servers can be
configured to take over the master role if the master database server becomes unavailable.

Replication is the process of sharing information so as to ensure consistency between redundant
resources, such as software or hardware components, to improve reliability, fault-tolerance, or
accessibility. It could be data replication if the same data is stored on multiple storage devices, or
computation replication if the same computing task is executed many times. A computational task is
typically replicated in space, i.e. executed on separate devices, or it could be replicated in time, if it is
executed repeatedly on a single device.
The access to a replicated entity is typically uniform with access to a single, non-replicated entity. The
replication itself should be transparent to an external user. Also, in a failure scenario, a failover of
replicas is hidden as much as possible.
It is common to talk about active and passive replication in systems that replicate data or services.
Active replication is performed by processing the same request at every replica. In passive replication,
each single request is processed on a single replica and then its state is transferred to the other replicas.
If at any time one master replica is designated to process all the requests, then we are talking about the
primary-backup scheme (master-slave scheme) predominant in high-availability clusters. On the other
side, if any replica processes a request and then distributes a new state, then this is a multi-primary
scheme (called multi-master in database field). In the multi-primary scheme, some form of distributed
concurrency control must be used, such as distributed lock manager.
Load balancing is different from task replication, since it distributes a load of different (not the same)
computations across machines, and allows a single computation to be dropped in case of failure. Load
balancing, however, sometimes uses data replication (esp. multi-master) internally, to distribute its data
among machines.
Backup is different from replication, since it saves a copy of data unchanged for a long period of time.
Replicas on the other hand are frequently updated and quickly lose any historical state.

Replication in distributed systems
Replication is one of the oldest and most important topics in the overall area of distributed systems.
Whether one replicates data or computation, the objective is to have some group of processes that
handle incoming events. If we replicate data, these processes are passive and operate only to maintain
the stored data, reply to read requests, and apply updates. When we replicate computation, the usual
goal is to provide fault-tolerance. For example, a replicated service might be used to control a
telephone switch, with the objective of ensuring that even if the primary controller fails, the backup can
take over its functions. But the underlying needs are the same in both cases: by ensuring that the
replicas see the same events in equivalent orders, they stay in consistent states and hence any replica
can respond to queries.

[edit] Replication models in distributed systems
A number of widely cited models exist for data replication, each having its own properties and
    1. Transactional replication. This is the model for replicating transactional data, for example a
       database or some other form of transactional storage structure. The one-copy serializability
       model is employed in this case, which defines legal outcomes of a transaction on replicated data
       in accordance with the overall ACID properties that transactional systems seek to guarantee.
    2. State machine replication. This model assumes that replicated process is a deterministic finite
       state machine and that atomic broadcast of every event is possible. It is based on a distributed
       computing problem called distributed consensus and has a great deal in common with the
       transactional replication model. This is sometimes mistakenly used as synonym of active
    3. Virtual synchrony. This computational model is used when a group of processes cooperate to
       replicate in-memory data or to coordinate actions. The model defines a new distributed entity
       called a process group. A process can join a group, which is much like opening a file: the
       process is added to the group, but is also provided with a checkpoint containing the current state
       of the data replicated by group members. Processes can then send events (multicasts) to the
       group and will see incoming events in the identical order, even if events are sent concurrently.
       Membership changes are handled as a special kind of platform-generated event that delivers a
       new membership view to the processes in the group.
Levels of performance vary widely depending on the model selected. Transactional replication is
slowest, at least when one-copy serializability guarantees are desired (better performance can be
obtained when a database uses log-based replication, but at the cost of possible inconsistencies if a
failure causes part of the log to be lost). Virtual synchrony is the fastest of the three models, but the
handling of failures is less rigorous than in the transactional model. State machine replication lies
somewhere in between; the model is faster than transactions, but much slower than virtual synchrony.
The virtual synchrony model is popular[citation needed] in part because it allows the developer to use
either active or passive replication. In contrast, state machine replication and transactional replication
are highly constraining and are often embedded into products at layers where end-users would not be
able[citation needed] to access them.

[edit] Database replication
Database replication can be used on many database management systems, usually with a master/slave
relationship between the original and the copies. The master logs the updates, which then ripple
through to the slaves. The slave outputs a message stating that it has received the update successfully,
thus allowing the sending (and potentially re-sending until successfully applied) of subsequent updates.
Multi-master replication, where updates can be submitted to any database node, and then ripple through
to other servers, is often desired, but introduces substantially increased costs and complexity which
may make it impractical in some situations. The most common challenge that exists in multi-master
replication is transactional conflict prevention or resolution. Most synchronous or eager replication
solutions do conflict prevention, while asynchronous solutions have to do conflict resolution. For
instance, if a record is changed on two nodes simultaneously, an eager replication system would detect
the conflict before confirming the commit and abort one of the transactions. A lazy replication system
would allow both transactions to commit and run a conflict resolution during resynchronization. The
resolution of such a conflict may be based on a timestamp of the transaction, on the hierarchy of the
origin nodes or on much more complex logic, which decides consistently on all nodes.
Database replication becomes difficult when it scales up. Usually, the scale up goes with two
dimensions, horizontal and vertical: horizontal scale up has more data replicas, vertical scale up has
data replicas located further away in distance. Problems raised by horizontal scale up can be alleviated
by a multi-layer multi-view access protocol. Vertical scale up runs into less trouble when the internet
reliability and performance are improving.

[edit] Disk storage replication
Active (real-time) storage replication is usually implemented by distributing updates of a block device
to several physical hard disks. This way, any file system supported by the operating system can be
replicated without modification, as the file system code works on a level above the block device driver
layer. It is implemented either in hardware (in a disk array controller) or in software (in a device
The most basic method is disk mirroring, typical for locally-connected disks.
Notably, the storage industry narrows the definitions, so mirroring is a local (short-distance) operation.
A replication is extendable across a computer network, so the disks can be located in physically distant
locations. The purpose is to avoid damage done by, and improve availability in case of local failures or
disasters. Typically the above master-slave theoretical replication model is applied. The main
characteristic of such solutions is handling write operations:
       Synchronous replication - guarantees "zero data loss" by the means of atomic write operation,
        i.e. write either completes on both sides or not at all. Write is not considered complete until
        acknowledgement by both local and remote storage. Most applications wait for a write
        transaction to complete before proceeding with further work, hence overall performance
        decreases considerably. Inherently, performance drops proportionally to distance, as latency is
        caused by speed of light. For 10 km distance, the fastest possible roundtrip takes 67 μs, whereas
        nowadays a whole local cached write completes in about 10-20 μs.
              An often-overlooked aspect of synchronous replication is the fact, that failure of either
                remote replica or even just the interconnection stops by definition any and all writes
                (freezing the storage system). This is the behaviour that guarantees zero data loss.
                However, many commercial systems at such potentially dangerous point do not freeze,
                but just proceed with local writes, losing the desired zero recovery point objective.
       Asynchronous replication - write is considered complete as soon as local storage acknowledges
        it. Remote storage is updated, but probably with a small lag. Performance is greatly increased,
        but in case of losing a local storage, the remote storage is not guaranteed to have the current
        copy of data and most recent data may be lost.
              Semi-synchronous replication - introduced by some vendors, probably as a sales pitch;
                exact meaning varies.
              Point-in-time replication - introduces periodic snapshots that are replicated instead of
                primary storage.
Most important implementations:
       DRBD module for Linux.
       EMC SRDF
       IBM PPRC and Global Mirror
       Hitachi TrueCopy
       Symantec Veritas Volume Replicator (VVR)

[edit] Distributed shared memory replication
Another example of using replication appears in distributed shared memory systems, where it may
happen that many nodes of the system share the same page of the memory - which usually means, that
each node has a separate copy (replica) of this page.

[edit] Primary-backup and multi-primary replication
Many classical approaches to replication are based on a primary/backup model where one device or
process has unilateral control over one or more other processes or devices. For example, the primary
might perform some computation, streaming a log of updates to a backup (standby) process, which can
then take over if the primary fails. This approach the most common one for replicating databases,
despite the risk that if a portion of the log is lost during a failure, the backup might not be in a state
identical to the one the primary was in, and transactions could then be lost.
A weakness of primary/backup schemes is that in settings where both processes could have been active,
only one is actually performing operations. We're gaining fault-tolerance but spending twice as much
money to get this property. For this reason, starting in the period around 1985, the distributed systems
research community began to explore alternative methods of replicating data. An outgrowth of this
work was the emergence of schemes in which a group of replicas could cooperate, with each process
backup up the others, and each handling some share of the workload.
Jim Gray, a towering figure within the database community, analyzed multi-primary replication
schemes under the transactional model and ultimately published a widely cited paper skeptical of the
approach ("The Dangers of Replication and a Solution"). In a nutshell, he argued that unless data splits
in some natural way so that the database can be treated as n disjoint sub-databases, concurrency control
conflicts will result in seriously degraded performance and the group of replicas will probably slow
down as a function of n. Indeed, he suggests that the most common approaches are likely to result in
degradation that scales as O(n³). His solution, which is to partition the data, is only viable in situations
where data actually has a natural partitioning key.
The situation is not always so bleak. For example, in the 1985-1987 period, the virtual synchrony
model was proposed and emerged as a widely adopted standard (it was used in the Isis Toolkit, Horus,
Transis, Ensemble, Totem, Spread, C-Ensemble, Phoenix and Quicksilver systems, and is the basis for
the CORBA fault-tolerant computing standard; the model is also used in IBM Websphere to replicate
business logic and in Microsoft's Windows Server 2008 enterprise clustering technology). Virtual
synchrony permits a multi-primary approach in which a group of processes cooperate to parallelize
some aspects of request processing. The scheme can only be used for some forms of in-memory data,
but when feasible, provides linear speedups in the size of the group.
A number of modern products support similar schemes. For example, the Spread Toolkit supports this
same virtual synchrony model and can be used to implement a multi-primary replication scheme; it
would also be possible to use C-Ensemble or Quicksilver in this manner. WANdisco permits active
replication where every node on a network is an exact copy or replica and hence every node on the
network is active at at one time; this scheme is optimized for use in a wide area network.
Optimistic replication
From Wikipedia, the free encyclopedia

Jump to: navigation, search
Optimistic replication[1] (also known as lazy replication[2][3]) is a strategy for replication in which
replicas are allowed to diverge. Traditional pessimistic replication systems are based on the principle of
single-copy consistency. that is, users should observe the system to behave as if there was only one
copy of the data. Optimistic replication does away with this in favor of eventual consistency, meaning
that replicas are guaranteed to converge only when a system is idle.

[edit] Algorithms
An optimistic replication algorithm consists of five elements:
    1. Operation submission: Users submit operations at independent sites.
    2. Propagation: Each site shares the operations it knows about with the rest of the system.
    3. Scheduling: Each site decides on an order for the operations it knows about.
    4. Conflict resolution: If there are any conflicts among the operations a site has scheduled, it must
       modify them in some way.
    5. Commitment: The sites agree on a final schedule and conflict resolution result, and the
       operations are made permanent.
There are two strategies for propagation: state transfer, where sites propagate a representation of the
current state, and operation transfer, where sites propagate the operations that were performed
(essentially, a list of instructions on how to reach the new state).
Scheduling and conflict resolution can either be syntactic or semantic. Syntactic systems rely on
general information, such as when or where an operation was submitted. Semantic systems are able to
make use of application-specific information to make smarter decisions. Note that state transfer
systems generally have no information about the semantics of the data being transferred, and so they
have to use syntactic scheduling and conflict resolution.

[edit] Examples
One well-known example of a system based on optimistic replication is the CVS version control
system, or any other version control system which uses the copy-modify-merge paradigm. CVS covers
each of the five elements:
    1. Operation submission: Users edit local versions of files.
    2. Propagation: Users manually pull updates from a central server, or push changes out once the
       user feels they are ready.
    3. Scheduling: Operations are scheduled in the order that they are received by the central server.
    4. Conflict resolution: When a user pushes to or pulls from the central repository, any conflicts
       will be flagged for that user to fix manually.
    5. Commitment: Once the central server accepts the changes which a user pushes, they are
       permanently committed.
A special case of replication is synchronization, where there are only two replicas. For example,
personal digital assistants (PDAs) allow users to edit data either on the PDA or a computer, and then to
merge these two datasets together. Note, however, that replication is a broader problem than
synchronization, since there may be more than two replicas.

May 18, 2004
Database Replication in MySQL
By Ian Gilfillan

An introduction to replication
Recently, while having the knots pounded out of my body during a particularly painful shiatsu
lesson, I reflected on what put them there in the first place. Yes, 'the database' was once more to
blame. A busy database I work with saw one of its tables jump from 3GB to 7GB overnight, as we
imported archive data. As expected, this had some performance impact. Unfortunately, I had not
expected quite the knock, and it turned out that this single database server could no longer
handle the load. No matter how much more I tried to optimize the queries, tweak the
variables or bump up the query cache, it was not enough. The machine could not take any
more memory, and a hardware upgrade would do little good (at least with the kind of budget I
have to play with). However, MySQL does not claim to be enterprise-ready for nothing, and Yahoo
and other high-volume users of MySQL certainly do not run on one database server. There are a
number of techniques to handle high volumes, one of which I will introduce this month - MySQL
replication (I will look at others in future articles).
Replication allows you to take one database, make an exact copy of it on another server, and set
one of them (the slave) to take all its updates from the other (the master). The slave reads the
master's binary logs, which store all statements that change a database, and repeats these on
its database, keeping the two in exact sync. Since a replicating database simply repeats
statements, the databases are not necessarily exactly in sync, and advanced users can take
advantage of this. That is a topic for another article however, and we will look at simple replication
this month - getting one database to be an exact copy of another one.

What replication is not
       Replication is not a backup policy. A mistyped DELETE statement will be replicated on the
        slave too, and you could end up with two, perfectly synchronized, empty databases.
        Replication can help protect against hardware failure though.
       Replication is not an answer to all performance problems. Although updates on the slave
        are more optimized than if you ran the updates normally, if you use MyISAM tables, table-
        locking will still occur, and databases under high-load could still struggle.
       Replication is not a guarantee that the slave will be in sync with the master at any one
        point in time. Even assuming the connection is always up, a busy slave may not yet have
        caught up with the master, so you can't simply interchange SELECT queries across master
        and slave servers.

How to start replicating - the master server
       Grant the slave permission to replicate with the REPLICATION SLAVE privilege, for example
        as follows: GRANT REPLICATION SLAVE ON *.* TO slave_user IDENTIFIED BY
       If the master is not using the binary update log, add the following lines to the my.cnf or
        my.ini configuration file, and restart the server:

        By convention, the master is usually server-id 1, and any slaves from 2 onwards, though
        you can change this if you wish. If the master is already using the binary update log, either
        take note of the offset at the moment of the backup (the next step), or use the RESET
        MASTER statement to clear all binary logs and immediately begin the backup. You may
        want to make a copy of the binary logs before doing this, in case you need to use the
        binary logs to restore from backup.
       Make a backup of the database. You will use this to start the slave server. Note the
        comments about the binary log above. You can also skip this step if you use the LOAD
        DATA FROM MASTER statement, but see the comments about locking the master below
       Add the following to the configuration file on the slave:

        The slave user and slave password are those to which you set when you granted
        REPLICATION SLAVE permission on the master. The server-id must be a unique number,
        different to the master or any other slaves in the system. There are also two other options:
        master-port, used if the master is running on a non-standard port (3306 is default), and
        master-connect-retry, a time in seconds for the slave to attempt to reconnect if the master
        goes down. 60 seconds is default.
       Restore the data from the master, either as you would normally restore a backup, or with
        the statement LOAD DATA FROM MASTER. The latter will lock the master for the duration
        of the operation, which could be quite lengthy, so you may not be able to spare the

Replication in action
Once the slave has started, replication should begin. Besides the obvious SELECT queries, you can
make sure this is working correctly with the following statements:
*************************** 1. row ***************************
          Master_User: slave_server
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: master-bin.054
  Read_Master_Log_Pos: 16664104
       Relay_Log_File: slave-relay-bin.045
        Relay_Log_Pos: 17657643
Relay_Master_Log_File: master-bin.054
     Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
      Replicate_do_db: vne
           Last_errno: 0
         Skip_counter: 0
  Exec_master_log_pos: 16664104
     Relay_log_space: 17657643

This is a mature slave that has been running a while. The master is already on the 54th binary
log. You can see if the slave is running correctly by looking at the Slave_IO_Running and
Slave_SQL_Running. The most important field is the Last_error field.
| File       | Position | Binlog_do_db | Binlog_ignore_db |
| master-bin.054 | 16664104 |              |             |

The above is from a master that has been running a while. It is already on binlog 54.

Starting to replicate from a particular point in the binary logs
If you need to force the slave to begin at a certain point, usually when the master has been
running with an active binary log, you can do so as follows. The following starts with the 3rd
binary log, as position 420. You can find the position using mysqlbinlog.
Query OK, 0 rows affected (0.00 sec)

The SLAVE START and SLAVE STOP commands are used to manually stop and start the slave. The
slave will also always stop if it comes across an error while replicating.

Removing old binary logs
On active databases, the binary logs tend to grow quite quickly. You may have used RESET
MASTER in the past to clear them, but you cannot do this to the master while replicating! The
statement to use is PURGE MASTER LOGS. First, make sure that all slaves have replicated to at
least the log beyond which you want to remove. For example, in our earlier example, both the
slave and the master are on log master-bin.054, so we can safely remove master-log.053 and
before, as follows:
mysql> PURGE MASTER LOGS TO 'master-bin.053';

MySQL will not allow you to remove a log that the master is still using though.

Replicating specific databases only
As mentioned earlier, the master and slave database server do not need to be entirely in sync,
with all databases and all tables from the master replicated onto the slave. By default, the slave
will replicate everything, but you can change this behavior with the following options in the slave
configuration file:
replicate-do-db=db_name (replicate this database)
replicate-ignore-db=db_name (don't replicate this database)
replicate-do-table=db_name.table_name (replicate this table)
replicate-ignore-table=db_name.table_name (don't replicate this table)
           (allows wildcards, e.g db% would be all databases beginning with db)
replicate-wild-ignore-table=db_name.table_name (ignore all specified tables, with wildcards)

These options can all be used multiple times in a single configuration. A couple of other useful
replicate-rewrite-db=master_db->slave_db (allows you to use map databases
           that use different database names on each server)
log-slave-update (writes replicated statements to the slaves binary logs)

A few complexities
       Before shutting down a slave server (mysqladmin-shutdown, not STOP SLAVE, make sure it
        has no temporary tables open (these may be needed for a statement to be replicated). You
        can do this with SHOW STATUS to see the value of Slave_open_temp_tables. This
        annoying feature should be fixed soon, so please check the latest documentation.
       The USER(), UUID(), LOAD_FILE() and CONNECTION_ID() (before MySQL 4.1.1) functions
        do not work reliably on the slave (they are replicated without changes).
       Before MySQL 4.1.1, FLUSH, ANALYZE, OPTIMIZE, and REPAIR statements are not
        replicated. This means that if you change permissions on the master by editing the tables
        directly, you will need to manually FLUSH PRIVILEGES on the slave too.
       Make sure the slave and the master are using the same character set.

Multi-master replication is a method of replication employed by databases to transfer data or changes
to data across multiple computers within a group. Multi-master replication can be contrasted with a
master-slave method (also known as single-master replication).
The term Multi-master can also be applied to systems in general where a single piece of information
can be updated by one of several systems. That is, no one system can be said to own the information
and be able to control it consistency and accuracy.
Benefits of Multi-master replication :
       If one master fails, other masters will continue to update the database.
       Masters can be located in several physical sites i.e. distributed across the network.
Disadvantages of Multi-master replication :
       Most multi-master replication systems are only loosely consistent, i.e. lazy and asynchronous,
        violating ACID properties.
       Eager replication systems are complex and introduce some communication latency.
       Issues such as conflict resolution can become intractable as the number of nodes involved rises
        and the required latency decreases
[edit] Log-Based
A database transaction log is referenced to capture changes made to the database. For log-based
transaction capturing, database changes can only be distributed asynchronously.

[edit] Trigger-Based
Triggers at the subscriber capture changes made to the database and submit them to the publisher. With
trigger-based transaction capturing, database changes can be distributed either synchronously or

[edit] Implementations
Many directory servers based on LDAP implement multi-master replication. OpenDS implements
"synchronization" instead.

[edit] Active Directory
For example, one of the more prevalent of multi-master replication implementations in directory
servers is Microsoft's Active Directory. Within Active Directory, objects that are updated on one
Domain Controller are then replicated to other domain controllers through multi-master replication. It
is not required for all domain controllers to replicate with each other's domain controller as this would
cause excessive network traffic in large Active Directory implementations. Instead, domain controllers
have a complex update pattern that ensures that all servers are updated in a timely fashion without
excessive replication traffic. Some Active Directory needs are better served by Flexible single master

[edit] Oracle
Oracle database clusters implement multi-master replication using one of two methods. Asynchronous
multi-master replication commits data changes to a deferred transaction queue which is periodically
processed on all databases in the cluster. Synchronous multi-master replication uses Oracle's two phase
commit functionality to ensure that all databases with the cluster have a consistent dataset.

[edit] MySQL
MySQL ships with replication support. Using features introduced in MySQL 5.0, it is possible to
achieve a multi-master replication scheme.[1]

[edit] PostgreSQL
PostgreSQL offers multiple solutions for multi-master replication, including solutions based on two
phase commit. There's Bucardo, PgPool and PgPool-II, PgCluster and Sequoia as well as some
proprietary solutions. Another promising approach, implementing eager but asynchronous replication is
Postgres-R. It's still in development.
[edit] Ingres
Ingres Within Ingres Replicator, objects that are updated on one Ingres server can then replicated to
other servers whether local or remote through multi-master replication. If one server fails, client
connections can be re-directed to another server. It is not required for all Ingres servers in an
environment to replicate with each other as this could cause excessive network traffic in large
implementations. Instead, Ingres Replicator provides an elegant and sophisticated design that allows
the appropriate data to be replicated to the appropriate servers without excessive replication traffic.
This means that some servers in the environment can serve as failover candidates while other servers
can meet other requirements such as managing a subset of columns or tables for a departmental
solution, a subset of rows for a geographical region or one-way replication for a reporting server. In the
event of a source, target, or network failure, data integrity is enforced through this two-phase commit
protocol by ensuring that either the whole transaction is replicated, or none of it is. In addition, Ingres
Replicator can operate over RDBMS’s from multiple vendors to connect them.

[edit] DBReplicator
DBReplicator is a powerful open-source Java application for network-based multi-master
heterogeneous database replication or filtered synchronization. Supported databases include:
CloudScape / Apache Derby / JavaDB, DaffodilDB / One$DB, IBM DB2, Firebird (database server),
MySQL, Oracle Database, PointBase, PostgreSQL, SQLServer, Sybase.

To top