A replica set in ApsaraDB for MongoDB is a group of mongod processes and contains a primary node and multiple secondary nodes. MongoDB drivers write data to the principal node only. Then, information is synchronized from the main node to secondary nodes. This ensures data consistency across all nodes in the replica fix. Therefore, replica sets provide high availability.

The following effigy is extracted from the official documentation of MongoDB. It shows a typical MongoDB replica set that contains one primary node and two secondary nodes.

Primary node election (1)

A replica prepare is initialized by running the replSetInitiate command or running the rs.initiate() command in the mongo shell. Later on the replica gear up is initialized, the members send heartbeat messages to each other and initiate the master node election. The node that receives votes from a majority of the members becomes the primary node and the other nodes become secondary nodes.

Initialize the replica gear up

                                          config = {         _id : "my_replica_set",         members : [              {_id : 0, host : "rs1.case.net:27017"},              {_id : 1, host : "rs2.instance.net:27017"},              {_id : 2, host : "rs3.instance.cyberspace:27017"},        ]     }     rs.initiate(config)                  

Definition of majority

A group of voting members is considered a majority simply if information technology contains members of more than the average of the total number of members. If the number of members in a replica set is equal to or less than the average number of all voting members, the election cannot be implemented. In this case, you cannot write data to the replica fix.

Number of voting members Majority Maximum number of failed nodes
1 ane 0
2 2 0
3 2 1
iv 3 ane
v iii ii
6 4 2
7 4 three

We recommend that yous set the number of members in a replica set to an odd number. The preceding table shows that both a replica set with three nodes and a replica set up with four nodes tolerate the failure of 1 node. The two replica sets provide the same service availability. However, the replica set with four nodes provides more than reliable data storage.

Special secondary nodes

Secondary nodes of a replica set participate in the primary node election. A secondary node may also be elected every bit the primary node. The latest data written to the primary node is synchronized to secondary nodes to ensure information consistency across all nodes.

You tin read data from secondary nodes. Therefore, you can add together secondary nodes to a replica ready to improve the read functioning and service availability of the replica fix. ApsaraDB for MongoDB allows you to configure secondary nodes of a replica set up to meet the requirements of different scenarios.

  • Czar

    An czar node participates in the ballot as a voter only. Information technology cannot exist elected every bit the main node or synchronize data from the primary node.

    Assume that you lot deploy a replica set that contains one primary node and ane secondary node. If one node fails, the principal node election cannot exist implemented. Every bit a event, the replica gear up becomes unavailable. In this instance, you can add an arbiter node to the replica set to enable primary node election.

    An arbiter node is a lightweight node that does not shop information. If the number of members in a replica fix is an even number, nosotros recommend that y'all add together an arbiter node to increase the availability of the replica set.

  • Priority0

    A node that has priority 0 in the primary node election cannot be elected equally the primary node.

    Assume that you deploy a replica set that contains nodes in both Data middle A and Data middle B. To ensure that the elected primary node is deployed in Data center A, set the priorities of the replica set members in Data heart B to 0.

    Note If y'all set the priorities of the members in Data middle B to 0, we recommend that you deploy a majority of replica fix nodes in Data center A. Otherwise, the main node ballot may neglect during network partitioning.

  • Vote0

    In MongoDB 3.0, a replica set contains a maximum of fifty members, and up to seven members can vote in a primary node ballot. You must set the members[n].votes attribute to 0 for members that are not expected to vote.

  • Hidden

    A hidden member in a replica set cannot be elected equally the primary node because its priority is 0. Hidden nodes are invisible to MongoDB drivers.

    Yous can use subconscious nodes to back up data or perform offline computation tasks. This does not bear upon the services of the replica set because hidden nodes do not process requests from MongoDB drivers.

  • Delayed

    A delayed node must be a hidden node. Data on a delayed node reflects an before state of the data on the primary node. If you configure a 1-hr filibuster, data on the delayed node is the aforementioned as the data on the primary node an hour ago.

    Therefore, if yous write incorrect or invalid data to the primary node, you tin can use the data on the delayed node to restore the data on the primary node to an earlier state.

Primary node ballot (2)

A primary node election is triggered not only afterward replica prepare initialization but also in the post-obit scenarios:

  • Replica set reconfiguration

    A main node ballot is triggered if the primary node fails or voluntarily steps down and becomes a secondary node. A master node election is affected past diverse factors, such every bit heartbeat messages amongst nodes, priorities of nodes, and the time when the last oplog entry was generated.

    • Node priorities

      All nodes tend to vote for the node that has the highest priority. A priority 0 node cannot trigger primary node elections. If a secondary node has a higher priority than the primary node and the time difference between the latest log entry of the secondary node and that of the primary node is within 10 seconds, the chief node steps downwardly. In this case, this secondary node becomes a candidate for the primary node.

    • Optime

      Merely secondary nodes that have the latest oplog entry are eligible to be elected as the primary node.

  • Network partitioning

    Only nodes that are continued to a majority of voting nodes tin be elected equally the principal node. If the primary node is disconnected from a majority of the other nodes in the replica ready, the primary node voluntarily steps down and becomes a secondary node. A replica set may take multiple primary nodes for a short flow of time during network sectionalization. When MongoDB drivers write data, nosotros recommend that you set a policy that allows data synchronization from but the principal node that is connected to a majority of nodes.

Data synchronization

Data is synchronized from the main node to secondary nodes based on the oplog. After a write operation on the primary node is complete, an oplog entry is written to the special local.oplog.rs collection. Secondary nodes constantly import new oplog entries from the primary node and apply the operations.

To foreclose unlimited growth in the size of the oplog, local.oplog.rs is configured as a capped collection. When the amount of oplog information reaches the specified threshold, the earliest entries are deleted. All operations in the oplog must exist idempotent. This ensures that an operation produces the same results regardless of whether information technology is repeatedly applied to secondary nodes.

The following code block is a sample oplog entry, which contains fields such as ts, h, op, ns, and o:

                                          {       "ts" : Timestamp(1446011584, 2),       "h" : NumberLong("1687359108795812092"),        "5" : two,        "op" : "i",        "ns" : "examination.nosql",        "o" : { "_id" : ObjectId("563062c0b085733f34ab4129"), "proper noun" : "mongodb", "score" : "100" }      }                  
  • ts: the time when the operation was performed. The value contains two numbers. The first number is a UNIX timestamp. The 2nd number is a counter that indicates the series number of each performance that occurs within a second. The counter is reset every second.
  • h: the unique identifier of the performance
  • v: the version of the oplog
  • op: the blazon of the operation
  • i: insert
  • u: update
  • d: delete
  • c: run commands such as createDatabase and dropDatabase
  • northward: nix. This value is used for special purposes.
  • ns: the collection on which the operation is performed
  • o: the operation details. This field is valid for update operations merely.
  • o2: the conditions of an update operation. This field is valid for update operations only.

During the initial synchronization, a secondary node runs the init sync control to synchronize all information from the primary node or another secondary node that stores the latest data. So, the secondary node continuously uses the tailable cursor feature to query the latest oplog entries in the local.oplog.rs collection of the primary node and applies the operations in these oplog entries.

The initial synchronization process includes the following steps:

  1. Before the fourth dimension of T1, the data synchronization tool runs the listDatabases, listCollections, and cloneCollection commands. At the fourth dimension of T1, all information in the cloud databases (except the local.oplog.rs database) starts to be synchronized from the primary node to the secondary node. Assume that the synchronization is completed at the time of T2.
  2. All the operations in oplog entries generated from T1 to T2 are applied to the secondary node. Operations in oplog entries are idempotent. Therefore, operations that have been applied in Step 1 can be reapplied.
  3. Based on the index for each drove on the chief node, indexes for the corresponding collections are created on the secondary node. The index for each drove on the primary node was created in Footstep 1.

    Note You must configure the oplog size based on the database size and the volume of data to be written by the awarding. If the oplog is oversized, the storage space may be wasted. If the oplog size is too small, the secondary node may fail to complete initial synchronization. For example, in Step 1, if the database stores a large corporeality of data and the oplog is not large enough, the oplog may neglect to store all the oplog entries generated from T1 to T2. As a result, the secondary node cannot synchronize all the data sets from the principal node.

Modify a replica set

You can modify a replica set up by running the replSetReconfig command or running the rs.reconfig() command in the mongo shell. For instance, you can add or delete members, and change the priority, vote, subconscious, and delayed attributes of a fellow member.

For example, you can run the following control to set the priority of the second member in the replica set to 2:

                                          cfg = rs.conf();     cfg.members[one].priority = 2;     rs.reconfig(cfg);                  

Scroll back operations on the primary node

Assume that the primary node of a replica ready fails. If write operations have been performed on the new primary node when the one-time primary node rejoins the replica set up, the erstwhile main node must scroll dorsum operations that have non been synchronized to other nodes. This ensures data consistency betwixt the former main node and the new primary node.

The former master node writes the rollback data to a dedicated directory. This allows the database ambassador to run the mongorestore command to restore operations as needed.

Replica set read/write settings

  • Read Preference

    Past default, all the read requests for a replica set are routed to the primary node. However, y'all can alter the read preference modes on the drivers to road read requests to other nodes.

    • main: This is the default mode. All read requests are routed to the primary node.
    • primaryPreferred: Read requests are routed to the principal node preferentially. If the primary node is unavailable, read requests are routed to secondary nodes.
    • secondary: All read requests are routed to secondary nodes.
    • secondaryPreferred: Read requests are routed to secondary nodes preferentially. If all secondary nodes are unavailable, read requests are routed to the primary node.
    • nearest: Read requests are routed to the nearest reachable node, which tin can be detected by running the ping control.
  • Write Business organization

    By default, the primary node returns a message that indicates a successful write performance after data is written to the principal node. You can gear up the write business concern on drivers to specify the rule for a successful write performance. For more than data, run across Write business organization.

    The following write business organization indicates that a write operation is successful just after the information is written to a majority of nodes earlier the request times out. The timeout catamenia is five seconds.

                                                      db.products.insert(       { detail: "envelopes", qty : 100, blazon: "Clasp" },       { writeConcern: { due west: majority, wtimeout: 5000 } }     )                      

    The preceding settings utilize to private requests. Y'all tin can also modify the default write concern of a replica set. The write business organisation of a replica set applies to all the requests for the replica prepare.

                                                      cfg = rs.conf()     cfg.settings = {}     cfg.settings.getLastErrorDefaults = { w: "bulk", wtimeout: 5000 }     rs.reconfig(cfg)