Join our Discord Server
Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Distinguished Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 700+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 9800+ members and discord server close to 2600+ members. You can follow him on Twitter(@ajeetsraina).

New Docker Engine 1.10 brings Enterprise-level security hardening – Secure Computing

6 min read

Security issues like kernel exploits,  denial-of-service attacks, container breakouts, poisoned images, compromising secrets etc. has been talked a lot in recent times for a container based environnment. With Docker’s “Build, ship and Run” mission, the introduction of the standardized images format has fueled an explosion of interest in container world in the enterprise IT. Docker allows greater sharing of resources and as you have dozens of applications running on your computer system, the chances of exposing your application to the outside world and risk of application vulnerability is sure to increase.

How to use Docker securely?

To use Docker safely, you need to be aware of the potential security issues and the major tools and techniques for securing container-based systems. The new Docker 1.10 release focus on the security aspect of the container technology.This new release brings several security improvements that merits the attention of Docker developers and operations team. In this article, we are going to look at one of the new security feature announced as part of Docker 1.10 release which is rightly  called “secure computing”.

In 2005, one of memory management contributor Andrea Arcangeli, started exploring what we today call “secure computing” (or “seccomp”) feature. Initially, he started working on a feature to enable owners of Linux systems to rent out their CPUs to people doing serious processing work. Allowing strangers to run arbitrary code is something that people tend to be nervous about; they require some pretty strong assurance that this code will not have general access to their systems.Seccomp solves this problem by putting a strict sandbox around processes running code from others. A process running in seccomp mode is severely limited in what it can do; interestingly there are only four system calls – read(), write(), exit(), and sigreturn() – available to it. Attempts to call any other system call result in immediate termination of the process. The idea is that a control process could obtain the code to be run and load it into memory. After setting up its file descriptors appropriately, this process would call:

prctl(PR_SET_SECCOMP, 1);

to enable seccomp mode. This enabled jumping into the guest code, knowing that no real harm could be done. The guest code can run in the CPU and communicate over the file descriptors given to it, but it has no other access to the system.  Later point  of time it was Google Team who thought that seccomp would make a good platform on which to create a “finished implementation” for Linux which they used it for Google Chrome.

Secure computing mode (Seccomp) is purely  a Linux kernel feature which enables you to control which system calls are disabled– and this applies to containers as well, when applied via Docker.You can use it to restrict the actions available within the container. The seccomp() system call operates on the seccomp state of the calling process. You can use this feature to restrict your application’s access.In addition, Docker implemented a default setting which is now part of the Docker daemon.

By default, the new Docker Seccomp profile disables 44 system calls, out of 313 available in 64bit Linux systems. These system calls are listed under Docker official site(shown below for reference).

Syscall Description
acct Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT.
add_key Prevent containers from using the kernel keyring, which is not namespaced.
adjtimex Similar to clock_settime and settimeofday, time/date is not namespaced.
bpf Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN.
clock_adjtime Time/date is not namespaced.
clock_settime Time/date is not namespaced.
clone Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
create_module Deny manipulation and functions on kernel modules.
delete_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
finit_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
get_kernel_syms Deny retrieval of exported kernel and module symbols.
get_mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
init_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
ioperm Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
iopl Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
kcmp Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
kexec_file_load Sister syscall of kexec_load that does the same thing, slightly different arguments.
kexec_load Deny loading a new kernel for later execution.
keyctl Prevent containers from using the kernel keyring, which is not namespaced.
lookup_dcookie Tracing/profiling syscall, which could leak a lot of information on the host.
mbind Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
modify_ldt Old syscall only used in 16-bit code and a potential information leak.
mount Deny mounting, already gated by CAP_SYS_ADMIN.
move_pages Syscall that modifies kernel memory and NUMA settings.
name_to_handle_at Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE.
nfsservctl Deny interaction with the kernel nfs daemon.
open_by_handle_at Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH.
perf_event_open Tracing/profiling syscall, which could leak a lot of information on the host.
personality Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns.
pivot_root Deny pivot_root, should be privileged operation.
process_vm_readv Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
process_vm_writev Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
ptrace Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.
query_module Deny manipulation and functions on kernel modules.
quotactl Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN.
reboot Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT.
restart_syscall Don’t allow containers to restart a syscall. Possible seccomp bypass see: https://code.google.com/p/chromium/issues/detail?id=408827.
request_key Prevent containers from using the kernel keyring, which is not namespaced.
set_mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
setns Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN.
settimeofday Time/date is not namespaced. Also gated by CAP_SYS_TIME.
stime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
swapon Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
swapoff Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
sysfs Obsolete syscall.
_sysctl Obsolete, replaced by /proc/sys.
umount Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
umount2 Should be a privileged operation.
unshare Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare --user.
uselib Older syscall related to shared libraries, unused for a long time.
ustat Obsolete syscall.
vm86 In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.
vm86old In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.

 

Source ~ https://docs.docker.com/engine/security/seccomp/

It’s time to look into seccomp feature implementation now. By default, Docker 1.10 available as part of DEB or RPM doesnt support seccomp feature.

Please Note: Seccomp profiles require seccomp 2.2.1 and are only available starting with Debian 9 “Stretch”, Ubuntu 15.10 “Wily”, and Fedora 22. To use this feature on Ubuntu 14.04, Debian Wheezy, or Debian Jessie, you must download the latest static Docker Linux binary.

I am using Ubuntu 14.04.3 for my Docker containers. I will be downloading static Docker Linux binary from https://docs.docker.com/engine/installation/binaries/ and run the binary to get seccomp implementation running.

root@dell:~# wget https://get.docker.com/builds/Linux/x86_64/docker-1.10.2
–2016-02-28 05:33:15–  https://get.docker.com/builds/Linux/x86_64/docker-1.10.2
Connecting to 10.116.2.242:80… connected.
Proxy request sent, awaiting response… 200 OK
Length: 34892323 (33M) [binary/octet-stream]
Saving to: ‘docker-1.10.2.1’

100%[==============================================>] 3,48,92,323  972KB/s   in 31s

2016-02-28 05:33:46 (1.06 MB/s) – ‘docker-1.10.2.1’ saved [34892323/34892323]

root@dell:~# chmod +x docker-1.10.2.1

root@dell:~# ./docker-1.10.2.1 daemon
INFO[0000] [graphdriver] using prior storage driver “aufs”
INFO[0000] Graph migration to content-addressability took 0.00 seconds
INFO[0000] Firewalld running: false
INFO[0000] Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option –bip can be used to set a preferred IP address
..

INFO[0001] Loading containers: done.
INFO[0001] Daemon has completed initialization
INFO[0001] Docker daemon   commit=c3959b1 execdriver=native-0.2 graphdriver=aufs version=1.10.2
INFO[0001] API listen on /var/run/docker.sock

Ensure that Docker service is running fine:

root@dell:~# docker version
Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 22:37:33 2016
 OS/Arch:      linux/amd64
root@dell~#

Passing unconfined to run a container without the default seccomp profile

Docker’s default seccomp profile is a whitelist which specifies the calls that are allowed.

Let’s take a simple example:

root@dell:~# docker run –rm -it –security-opt seccomp:unconfined  hello-world

Hello from Docker.
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the “hello-world” image from the Docker Hub.
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker Hub account:
https://hub.docker.com

For more examples and ideas, visit:
https://docs.docker.com/userguide/

Passing a profile to run a container with a default seccomp profile

let’s pick up two syscalls – “mount” and “chown” for this example.

Create a json file with the below contents and save it as mypolicy.json.

{

“defaultAction”: “SCMP_ACT_ALLOW”,
    “syscalls”: [
        {
            “name”: “mount”,
            “action”: “SCMP_ACT_ERRNO”
        },
                
        {
            “name”: “chown”,
            “action”: “SCMP_ACT_ERRNO”
        }
        
    ]
}

Ensure that the format is exactly same as json is little delicate in terms of its entries.

Run the below command to enable secomp profile for a container:
root@dell-PowerEdge-R630:~# docker run -it –security-opt seccomp:mypolicy.json busybox sh
/ # ls
bin   dev   etc   home  proc  root  sys   tmp   usr   var
/ # mkdir test
/ # chown root:root test
chown: test: Operation not permitted
/ #

The first command will try to pass seccomp-opt parameter to the container being built. Once the user is inside the container, if he tries to change the ownership of a file he is not permitted to do so. Similarly, he is retricted to run mount command.

In summary, we experienced that the “secured computing” provides the flexibility to harden specific containers based on their actual needs. Said that, it still requires in-depth knowledge of the system calls and container behaviour.This tool arrives as a weapon for Developers and Docker enthusiasts to make Docker’s mission “Build, Ship and Run” more secure and production-ready.

 

 

 

Have Queries? Join https://launchpass.com/collabnix

Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Distinguished Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 700+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 9800+ members and discord server close to 2600+ members. You can follow him on Twitter(@ajeetsraina).
Join our Discord Server
Index