Understanding Node Failure Handling under Docker 1.12 Swarm Mode

In the last Meetup (#Docker Bangalore), there has been lots of curiosity around “Desired State Reconciliation” & “Node Management” feature in case of Docker Engine 1.12 Swarm Mode.  I found lots of queries post the presentation session on how Node Failure Handling is taken care in case of new Docker Swarm Mode , particularly when master node participating in the raft consensus goes down. Under this blog post, I will demonstrate how Master Node Failure is achieved which is very specific to RAFT consensus algorithm. We will look at how Swarmkit (the technical foundation of Swarm Mode implementation) uses the raft consensus algorithm and enables NO single point of failure feature to perform effective decision in the distributed system.

In the previous post we did a deep-dive into Swarm Mode implementation where we talked about the communication in between manager and worker nodes. Machines running SwarmKit can be grouped together in order to form a Swarm, coordinating tasks with each other. Once a machine joins, it becomes a Swarm Node. Nodes can either be worker nodes or manager nodes. Worker nodes are responsible for running Tasks while Manager nodes accept specifications from the user and are responsible for reconciling the desired state with the actual cluster state.

1-x1CGldynWhcl5mOPievFYA

Manager nodes maintain a strongly consistent, replicated (Raft based) and extremely fast (in-memory reads) view of the cluster which allows them to make quick scheduling decisions while tolerating failures.Node roles (Worker or Manager) can be dynamically changed through API/CLI calls.  Say, if any of master or worker node fails, SwarmKit reschedules its tasks(which are nothing but containers) onto a different node.

A Quick Brief on Raft Consensus Algorithm

Let’s understand what raft consensus is all about. A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader. Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader. To understand the raft implementation, I recommend reading https://github.com/hashicorp/raft

Node-1

PLEASE NOTE that there should always be an odd number of managers (1,3,5 or 7) to reach the consensus.  If you have just two managers, with one manager down results in a situation where you can’t achieve the consensus.Reason –  greater than 50% of the managers need to “agree” to actually makes the raft consensus work.

Demonstrating Manager Node Failure

Let me demonstrate the master node failure scenario with the existing Swarm Mode cluster running on Google Cloud Engine. As shown below, I have 5 nodes forming Swarm Mode cluster installed running the experimental Docker 1.12.0-rc4 release.

 

snap-1

The Swarm Mode cluster is already running a service which is replicated across 3 nodes – test-master1, test-node2 and test-node1 out of total 5 nodes. Let us use docker-machine(my all-time favorite) command to ssh to test-master1 and promote workers (test-node1 and test-node2) to the manager node as shown above.

snap-2

Hence, the worker nodes are rightly promoted to manager node which is shown as “Reachable”.

The “$docker ps” command shows that there is a task (container) already running on the master node. Please remember that “$docker ps” has to manually run on the dedicated node to know what local containers are running on the particular node.

snap-3

The below picture depicts the detailed list of the containers(or tasks) which are distributed across the swarm cluster.

snap-5

Let’s bring down the manager node “test-master1” either by shutting it down uncleanly or stopping the instance through the available GCE feature.(as show below). The manager node(test-master1) is no longer reachable. If you try to ssh to test-node2 and check if the cluster is up and running, you will find that node failure has been taken care and desired state reconciliation comes into the picture. Now the 3-replicas of tasks or containers are running on test-node1, test-node2 and test-node3.

snap-6

 

To implement raft consensus, there is a minimal recommendation of an odd number of managers (1,3,5 or 7). The maximum recommendation of manager node is 5 for better performance while increasing the manager nodes to 7 might incur performance bottleneck as there will be additional overhead in terms of communication to keep the mutual agreement in place between the managers.

12 thoughts on “Understanding Node Failure Handling under Docker 1.12 Swarm Mode

  1. Thanks for one’s marvelous posting! I seriously enjoyed reading
    it, you can be a great author. I will make sure to bookmark your blog and definitely will come back very
    soon. I want to encourage you to definitely continue your great writing, have a nice evening!

  2. C. Mitchell

    Great post! I have a question in regards to joining new nodes to the cluster. When Swarm Mode is used to initialize a Swarm, tokens are generated. The tokens can be used with the `swarm join` command to add a node or manager to the cluster. Does this continue to work if the initial manager is no longer available? In other words, are the tokens still valid on the other managers?

    1. @Mitchell, Thanks for reading the blog. During the initialization of Swarm Mode, Swarm Manager is required. While we promote 2 of other nodes to Leader, then the swarm mode is going to survive even if the initial manager node goes down. Hence, the cluster is going to work even if initial manager is no longer available.Atleast there should be 1 manager and 2 Leaders to be running to handle the situation when one of the manager node goes down.

  3. An interesting discussion will probably be worth comment. I believe that you need to write much more about this topic, it might not often be a taboo subject but normally consumers are insufficient to communicate on such topics. To a higher. Cheers

  4. This design is steller! You certainly realize how to keep a reader amused.

    In between your wit as well as your videos,
    I used to be almost moved to start my very own blog (well, almost…HaHa!) Fantastic job.
    I actually loved whatever you needed to say, and over that, the way you presented it.
    Too cool!

  5. My brother recommended I might like this web site. He was entirely right.
    This submit truly made my day. You cann’t consider simply how so
    much time I had spent for this information! Thanks!

  6. I read through this post completely about the resemblance
    of newest and preceding technologies, it’s remarkable article.

  7. Exceptional post however , I was wanting to know if you
    could write a litte more on this subject? I’d be very thankful
    if you could elaborate a little bit more. Thanks!

  8. Agreed. You have a good blog with very useful and fascinating posts covering docker, what I am working on at work! Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *