demystifying-containers's Introduction

Demystifying Containers

This series of blog posts and corresponding talks aims to provide you with a pragmatic view on containers from a historic perspective. Together we will discover modern cloud architectures layer by layer, which means we will start at the Linux Kernel level and end up at writing our own secure cloud native applications.

Simple examples paired with the historic background will guide you from the beginning with a minimal Linux environment up to crafting secure containers, which fit perfectly into todays’ and futures’ orchestration world. In the end it should be much easier to understand how features within the Linux kernel, container tools, runtimes, software defined networks and orchestration software like Kubernetes are designed and how they work under the hood.

Part I: Kernel Space
Part II: Container Runtimes
Part III: Container Images
Part IV: Container Security

Part I: Kernel Space

This first blog post (and talk) is scoped to Linux kernel related topics, which will provide you with the necessary foundation to build up a deep understanding about containers. We will gain an insight about the history of UNIX, Linux and talk about solutions like chroot, namespaces and cgroups combined with hacking our own examples. Besides this we will peel some containers to get a feeling about future topics we will talk about.

You can find the blog post:

The corresponding talk:

The slides of the talk:

on Slides.com

Part II: Container Runtimes

This second blog post (and talk) is primary scoped to container runtimes, where we will start with their historic origins before digging deeper into two dedicated projects: runc and CRI-O. We will initially build up a great foundation about how container runtimes work under the hood by starting with the lower level runtime runc. Afterwards, we will utilize the more advanced runtime CRI-O to run Kubernetes native workloads, but without even running Kubernetes at all.

You can find the blog post:

The corresponding talk:

The slides of the talk:

on Slides.com

Part III: Container Images

This third blog post (and talk) will be all about container images. As usual, we start with the historic background and the evolution of different container image formats. Afterwards, we will check out what is inside of the latest Open Container Initiative (OCI) image specification by crafting, modifying and pulling apart our self-built container image examples. Besides that, we will learn some important best practices in modern container image creation by utilizing tools like buildah, podman and skopeo.

You can find the blog post:

The corresponding talk:

The slides of the talk:

on Slides.com

Part IV: Container Security

Security-related topics can be overwhelming, especially when we’re talking about the fast-pacing container ecosystem. After encountering multiple security vulnerabilities in 2019, the press is now questioning if containers are secure enough for our applications and if switching from Virtual Machines (VMs) to container-based workloads is really a good idea. Technologies like micro VMs target to add an additional layer of security to sensitive applications.

But is security really a problem when speaking about running applications inside? It indeed is, if we do not fully understand the implications of the security-related possibilities we can apply or if we don’t use them at all.

In this blog post, we will discover the bright world of container security in a pragmatic way. We will learn about relatively low level security mechanisms like Linux capabilities or seccomp, but also about fully featured security enhancements like SELinux and AppArmor. We’ll have the chance to build up a common ground of understanding around container security. Besides that, we will take a look into securing container workloads at a higher level inside Kubernetes clusters by using Pod Security Policies and by securing the container images itself. To achieve all of this, we will verify the results of our experiments by utilizing end-user applications like Kubernetes and Podman.

You can find the blog post:

Part X

Further parts of the series are not available yet.

Contributing

You want to contribute to this project? Wow, thanks! So please just fork it and send me a pull request.

demystifying-containers's Issues

part1: comments

Hiya, looks good overall. Just a few details that stood out.

chroot isn't used in containers anymore (for a variety of reasons related to the fact that chroot protections are pretty bad even in modern kernels), instead we make use of pivot_root which is a very similar concept but instead it actually modifies what / means inside the entire mount namespace rather than a single process's context. This results in the entire VFS layer simply not being able to access the mounts that existed in / before -- which mostly prevents attacks like the fchdir one (file descriptors are still an issue but you can't use them to get above the / of the namespace the file descriptor is opened within). The key bit of magic looks like:

% pivot_root new_root new_root/old_root
% # now we're in new_root, but /old_root contains all the mounts for the old one.
% mount --make-rprivate /old_root
% mount --detach /old_root

setns doesn't help keep namespaces around -- they are kept around as long as there is a reference to them (like basically all kernel structures). You can keep namespaces around by having a process that is in a namespace, or keeping around the /proc/$pid/ns magic link (this can be done by opening it in a process, or bind-mounting it) -- which you mention later.
This is slightly pedantic, but "symlinks" inside /proc aren't actually symlinks in the same sense as normal filesystems. Rather, they are "magic links" which allow you to gain access to the struct file they reference without going through the VFS -- which means that you don't need to be able to resolve the path (which is a restriction that symlinks have because they're applied as though each symlink was expanded within the original path string). You don't need to get into any of this of course -- maybe just put "symlinks" in quotes in that section and have a comment that they aren't real symlinks and move on.
I would phrase "all processes within the namespace are attached to the root PID" as "all processes within the namespace will be re-parented to the namespace's PID 1 rather than the host PID 1". Also I would fix up the following line to say "In addition" rather than "This also means", since it's a separate effect.
I would mention /proc/$pid/setgroups as part of the set of files for user namespaces, since it's required in order to set up rootless containers (and is also useful in other contexts).
Not a huge point, but you might want to use cgroup.procs rather than tasks but that's just a matter of taste (tasks supports thread-level granularity, while cgroup.procs will put the entire thread-group into the cgroup). You also don't mention cgroupv2 (which is understandable, but it is a fairly interesting topic if you haven't looked at it before).

Recommend Projects

saschagrunert / demystifying-containers Goto Github PK