In the last Meetup (#Docker Bangalore), there has been lots of curiosity around "Desired State Reconciliation" & "Node Management" feature in case of Docker Engine 1.12 Swarm Mode. I found lots of queries post the presentation session on how Node Failure Handling is taken care in case of new Docker Swarm Mode , particularly when master node participating in the raft consensus goes down. Under this blog post, I will demonstrate how Master Node Failure is achieved which is very specific to RAFT consensus algorithm. We will look at how Swarmkit (the technical foundation of Swarm Mode implementation) uses the raft consensus algorithm and enables NO single point of failure feature to perform effective decision in the distributed system.
In the previous post we did a deep-dive into Swarm Mode implementation where we talked about the communication in between manager and worker nodes. Machines running SwarmKit can be grouped together in order to form a Swarm, coordinating tasks with each other. Once a machine joins, it becomes a Swarm Node. Nodes can either be worker nodes or manager nodes. Worker nodes are responsible for running Tasks while Manager nodes accept specifications from the user and are responsible for reconciling the desired state with the actual cluster state.
Manager nodes maintain a strongly consistent, replicated (Raft based) and extremely fast (in-memory reads) view of the cluster which allows them to make quick scheduling decisions while tolerating failures.Node roles (Worker or Manager) can be dynamically changed through API/CLI calls. Say, if any of master or worker node fails, SwarmKit reschedules its tasks(which are nothing but containers) onto a different node.
A Quick Brief on Raft Consensus Algorithm
Let's understand what raft consensus is all about. A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader. Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader. To understand the raft implementation, I recommend reading https://github.com/hashicorp/raft
PLEASE NOTE that there should always be an odd number of managers (1,3,5 or 7) to reach the consensus. If you have just two managers, with one manager down results in a situation where you can't achieve the consensus.Reason - greater than 50% of the managers need to "agree" to actually makes the raft consensus work.
Demonstrating Manager Node Failure
Let me demonstrate the master node failure scenario with the existing Swarm Mode cluster running on Google Cloud Engine. As shown below, I have 5 nodes forming Swarm Mode cluster installed running the experimental Docker 1.12.0-rc4 release.
The Swarm Mode cluster is already running a service which is replicated across 3 nodes - test-master1, test-node2 and test-node1 out of total 5 nodes. Let us use docker-machine(my all-time favorite) command to ssh to test-master1 and promote workers (test-node1 and test-node2) to the manager node as shown above.
Hence, the worker nodes are rightly promoted to manager node which is shown as "Reachable".
The "$docker ps" command shows that there is a task (container) already running on the master node. Please remember that "$docker ps" has to manually run on the dedicated node to know what local containers are running on the particular node.
The below picture depicts the detailed list of the containers(or tasks) which are distributed across the swarm cluster.
Let's bring down the manager node "test-master1" either by shutting it down uncleanly or stopping the instance through the available GCE feature.(as show below). The manager node(test-master1) is no longer reachable. If you try to ssh to test-node2 and check if the cluster is up and running, you will find that node failure has been taken care and desired state reconciliation comes into the picture. Now the 3-replicas of tasks or containers are running on test-node1, test-node2 and test-node3.
To implement raft consensus, there is a minimal recommendation of an odd number of managers (1,3,5 or 7). The maximum recommendation of manager node is 5 for better performance while increasing the manager nodes to 7 might incur performance bottleneck as there will be additional overhead in terms of communication to keep the mutual agreement in place between the managers.