Security issues like kernel exploits, denial-of-service attacks, container breakouts, poisoned images, compromising secrets etc. has been talked a lot in recent times for a container based environnment. With Docker’s “Build, ship and Run” mission, the introduction of the standardized images format has fueled an explosion of interest in container world in the enterprise IT. Docker allows greater sharing of resources and as you have dozens of applications running on your computer system, the chances of exposing your application to the outside world and risk of application vulnerability is sure to increase.
How to use Docker securely?
To use Docker safely, you need to be aware of the potential security issues and the major tools and techniques for securing container-based systems. The new Docker 1.10 release focus on the security aspect of the container technology.This new release brings several security improvements that merits the attention of Docker developers and operations team. In this article, we are going to look at one of the new security feature announced as part of Docker 1.10 release which is rightly called “secure computing”.
In 2005, one of memory management contributor Andrea Arcangeli, started exploring what we today call “secure computing” (or “seccomp”) feature. Initially, he started working on a feature to enable owners of Linux systems to rent out their CPUs to people doing serious processing work. Allowing strangers to run arbitrary code is something that people tend to be nervous about; they require some pretty strong assurance that this code will not have general access to their systems.Seccomp solves this problem by putting a strict sandbox around processes running code from others. A process running in seccomp mode is severely limited in what it can do; interestingly there are only four system calls – read(), write(), exit(), and sigreturn() – available to it. Attempts to call any other system call result in immediate termination of the process. The idea is that a control process could obtain the code to be run and load it into memory. After setting up its file descriptors appropriately, this process would call:
prctl(PR_SET_SECCOMP, 1);
to enable seccomp mode. This enabled jumping into the guest code, knowing that no real harm could be done. The guest code can run in the CPU and communicate over the file descriptors given to it, but it has no other access to the system. Later point of time it was Google Team who thought that seccomp would make a good platform on which to create a “finished implementation” for Linux which they used it for Google Chrome.
Secure computing mode (Seccomp) is purely a Linux kernel feature which enables you to control which system calls are disabled– and this applies to containers as well, when applied via Docker.You can use it to restrict the actions available within the container. The seccomp() system call operates on the seccomp state of the calling process. You can use this feature to restrict your application’s access.In addition, Docker implemented a default setting which is now part of the Docker daemon.
By default, the new Docker Seccomp profile disables 44 system calls, out of 313 available in 64bit Linux systems. These system calls are listed under Docker official site(shown below for reference).
Syscall | Description |
---|---|
acct |
Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT . |
add_key |
Prevent containers from using the kernel keyring, which is not namespaced. |
adjtimex |
Similar to clock_settime and settimeofday , time/date is not namespaced. |
bpf |
Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN . |
clock_adjtime |
Time/date is not namespaced. |
clock_settime |
Time/date is not namespaced. |
clone |
Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS . |
create_module |
Deny manipulation and functions on kernel modules. |
delete_module |
Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE . |
finit_module |
Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE . |
get_kernel_syms |
Deny retrieval of exported kernel and module symbols. |
get_mempolicy |
Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE . |
init_module |
Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE . |
ioperm |
Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO . |
iopl |
Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO . |
kcmp |
Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE . |
kexec_file_load |
Sister syscall of kexec_load that does the same thing, slightly different arguments. |
kexec_load |
Deny loading a new kernel for later execution. |
keyctl |
Prevent containers from using the kernel keyring, which is not namespaced. |
lookup_dcookie |
Tracing/profiling syscall, which could leak a lot of information on the host. |
mbind |
Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE . |
modify_ldt |
Old syscall only used in 16-bit code and a potential information leak. |
mount |
Deny mounting, already gated by CAP_SYS_ADMIN . |
move_pages |
Syscall that modifies kernel memory and NUMA settings. |
name_to_handle_at |
Sister syscall to open_by_handle_at . Already gated by CAP_SYS_NICE . |
nfsservctl |
Deny interaction with the kernel nfs daemon. |
open_by_handle_at |
Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH . |
perf_event_open |
Tracing/profiling syscall, which could leak a lot of information on the host. |
personality |
Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
pivot_root |
Deny pivot_root , should be privileged operation. |
process_vm_readv |
Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE . |
process_vm_writev |
Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE . |
ptrace |
Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE . |
query_module |
Deny manipulation and functions on kernel modules. |
quotactl |
Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN . |
reboot |
Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT . |
restart_syscall |
Don’t allow containers to restart a syscall. Possible seccomp bypass see: https://code.google.com/p/chromium/issues/detail?id=408827. |
request_key |
Prevent containers from using the kernel keyring, which is not namespaced. |
set_mempolicy |
Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE . |
setns |
Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN . |
settimeofday |
Time/date is not namespaced. Also gated by CAP_SYS_TIME . |
stime |
Time/date is not namespaced. Also gated by CAP_SYS_TIME . |
swapon |
Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN . |
swapoff |
Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN . |
sysfs |
Obsolete syscall. |
_sysctl |
Obsolete, replaced by /proc/sys. |
umount |
Should be a privileged operation. Also gated by CAP_SYS_ADMIN . |
umount2 |
Should be a privileged operation. |
unshare |
Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN , with the exception of unshare --user . |
uselib |
Older syscall related to shared libraries, unused for a long time. |
ustat |
Obsolete syscall. |
vm86 |
In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN . |
vm86old |
In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN .
|
Source ~ https://docs.docker.com/engine/security/seccomp/
It’s time to look into seccomp feature implementation now. By default, Docker 1.10 available as part of DEB or RPM doesnt support seccomp feature.
Please Note: Seccomp profiles require seccomp 2.2.1 and are only available starting with Debian 9 “Stretch”, Ubuntu 15.10 “Wily”, and Fedora 22. To use this feature on Ubuntu 14.04, Debian Wheezy, or Debian Jessie, you must download the latest static Docker Linux binary.
I am using Ubuntu 14.04.3 for my Docker containers. I will be downloading static Docker Linux binary from https://docs.docker.com/engine/installation/binaries/ and run the binary to get seccomp implementation running.
root@dell:~# wget https://get.docker.com/builds/Linux/x86_64/docker-1.10.2
–2016-02-28 05:33:15– https://get.docker.com/builds/Linux/x86_64/docker-1.10.2
Connecting to 10.116.2.242:80… connected.
Proxy request sent, awaiting response… 200 OK
Length: 34892323 (33M) [binary/octet-stream]
Saving to: ‘docker-1.10.2.1’
100%[==============================================>] 3,48,92,323 972KB/s in 31s
2016-02-28 05:33:46 (1.06 MB/s) – ‘docker-1.10.2.1’ saved [34892323/34892323]
root@dell:~# chmod +x docker-1.10.2.1
root@dell:~# ./docker-1.10.2.1 daemon
INFO[0000] [graphdriver] using prior storage driver “aufs”
INFO[0000] Graph migration to content-addressability took 0.00 seconds
INFO[0000] Firewalld running: false
INFO[0000] Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option –bip can be used to set a preferred IP address
..
INFO[0001] Loading containers: done.
INFO[0001] Daemon has completed initialization
INFO[0001] Docker daemon commit=c3959b1 execdriver=native-0.2 graphdriver=aufs version=1.10.2
INFO[0001] API listen on /var/run/docker.sock
Ensure that Docker service is running fine:
root@dell:~# docker version
Client:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 21:37:01 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.2
API version: 1.22
Go version: go1.5.3
Git commit: c3959b1
Built: Mon Feb 22 22:37:33 2016
OS/Arch: linux/amd64
root@dell~#
Passing unconfined
to run a container without the default seccomp profile
Docker’s default seccomp profile is a whitelist which specifies the calls that are allowed.
Let’s take a simple example:
root@dell:~# docker run –rm -it –security-opt seccomp:unconfined hello-world
Hello from Docker.
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the “hello-world” image from the Docker Hub.
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker Hub account:
https://hub.docker.com
For more examples and ideas, visit:
https://docs.docker.com/userguide/
Passing a profile to run a container with a default seccomp profile
let’s pick up two syscalls – “mount” and “chown” for this example.
Create a json file with the below contents and save it as mypolicy.json.
{
“defaultAction”: “SCMP_ACT_ALLOW”,
“syscalls”: [
{
“name”: “mount”,
“action”: “SCMP_ACT_ERRNO”
},
{
“name”: “chown”,
“action”: “SCMP_ACT_ERRNO”
}
]
}
Ensure that the format is exactly same as json is little delicate in terms of its entries.
Run the below command to enable secomp profile for a container:
root@dell-PowerEdge-R630:~# docker run -it –security-opt seccomp:mypolicy.json busybox sh
/ # ls
bin dev etc home proc root sys tmp usr var
/ # mkdir test
/ # chown root:root test
chown: test: Operation not permitted
/ #
The first command will try to pass seccomp-opt parameter to the container being built. Once the user is inside the container, if he tries to change the ownership of a file he is not permitted to do so. Similarly, he is retricted to run mount command.
In summary, we experienced that t
Comments are closed.