How Do Containers Work

How Do Containers Work?

Introduction

If you’ve used Docker or Podman before, you’re likely familiar with the concept of containers. But what exactly is a container? What is it made of? How does it work? And how does it differ from virtual machines? Even if you think you know all the answers, you might still learn something new here.

What is a Container?

When you run a command like this in your terminal:

docker run --rm -it alpine sh

You’ll see output similar to the following:

Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
9824c27679d3: Pull complete
Digest: sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1
Status: Downloaded newer image for alpine:latest
/ #

Eventually, you’re given a shell where you can type commands, install packages, create users, and do anything else you want.

But what actually happened behind the scenes? The logs tell you that Docker didn’t find the alpine image locally, pulled it from a registry, and started a shell. This is a good high-level summary, but it doesn’t explain the underlying mechanics.

To understand how Docker and Podman work their magic, we first need to look at some key features of the Linux kernel. These features, called namespaces and cgroups, allow the Linux kernel to run each process in isolation from other processes on the system.

Linux Namespaces

Let’s start by exploring Linux namespaces. A Linux namespace is a kernel feature that controls what a process can see. You can think of namespaces as “boxes” for processes. Each process is contained within a set of these boxes, and if a process misbehaves (like trying to delete the entire filesystem), it won’t affect other processes that aren’t sharing the same namespaces. You can find a more technical definition by checking the man namespaces page.

At the time of writing, there are 8 namespaces:

  • IPC: System V IPC, POSIX message queues
  • Network: Network devices, stacks, ports, etc.
  • Mount: Mount points
  • PID: Process IDs
  • Time: Boot and monotonic clocks
  • User: User and group IDs
  • UTS: Hostname and NIS domain name

I also mentioned cgroups. While namespaces control what a process can see, cgroups control the resources a process can use (CPU, memory, disk I/O, etc.). I won’t go into too much detail about cgroups for now, as I believe understanding namespaces is the most crucial first step to grasping the core concept of a container.

Isolating a Process

We can isolate a process using the unshare system call, which runs a program in new namespaces. As specified in the man page:

The unshare command creates new namespaces (as specified by the command-line options described below) and then executes the specified program. If program is not given, then “${SHELL}” is run (default: /bin/sh).

Let’s start with the UTS namespace. Using this namespace, we can change the hostname for a process without affecting the host machine it’s running on.

$ sudo unshare --uts bash
root@laptop:/home/limerc#

Now, let’s change the hostname within this new shell.

root@laptop:/home/limerc# hostname container
root@laptop:/home/limerc# hostname
container

As you can see, the output is now container. We’ve successfully changed the hostname for this process. Now, let’s exit the process by typing exit and check the hostname of the host machine again.

root@laptop:/home/limerc# exit
exit
~
$ hostname
laptop

The hostname of the host machine did not change. This simple experiment demonstrates how namespaces work: we provided the process with its own hostname resource, which is independent of the host and other processes.

Next, let’s introduce the PID namespace. This one is important to understand. The PID namespace isolates the process ID number space. This means the same process ID number can exist in different namespaces. For example, you can’t have two processes with the same ID on your host machine, but you can have PID 1 on your host and another PID 1 inside a namespace. The PID within the namespace is relative; it’s PID 1 from the process’s perspective, but it has a different, unique ID on the host. A demonstration will make this clearer.

I’ll run the unshare command again with the --pid flag to create a new PID namespace. Then, in the new shell, I’ll run the ps command.

$ sudo unshare --pid sh
# ps
    PID TTY          TIME CMD
      1 ?        00:00:14 systemd
      2 ?        00:00:00 kthreadd
      ...

This output shows all the processes from the host machine. But why? When I first tinkered with Linux namespaces, I was confused. I thought running ps inside a PID namespace would only show processes running within that “container.”

The reason for this behavior is that the ps command reads process information from the virtual filesystem /proc, which starts at the root /. Even though our process is isolated by a namespace, the ps command is still reading from the host’s /proc. To fix this, we need to give the container its own root filesystem.

Changing the Root (chroot)

There’s a command called chroot:

chroot - run command or interactive shell with special root directory

This command does exactly what it says: it changes the filesystem perspective for a process. You can create an arbitrary directory on your host machine and use chroot to set it as the root for your process.

$ mkdir process-root
$ sudo chroot process-root
chroot: failed to run command ‘/bin/bash’: No such file or directory

The problem is that when we changed the root directory, there was no /bin directory, so chroot couldn’t find the /bin/bash executable. We could fix this by manually creating a ./process-root/bin directory and copying bash and its dependencies, but that’s a tedious process.

Instead, let’s download the Alpine Linux filesystem, which is a very small, minimal distribution.

$ curl -LO https://dl-cdn.alpinelinux.org/alpine/v3.22/releases/x86_64/alpine-minirootfs-3.22.1-x86_64.tar.gz

$ tar xzf alpine-minirootfs-3.22.1-x86_64.tar.gz -C process-root/

$ ls process-root/
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

As you can see, we downloaded the Alpine root filesystem and extracted it into the process-root directory, which will be the new root for our process.

Now, let’s run the chroot command again.

$ sudo chroot process-root sh
/ #

It worked! With chroot, we can give a process its own filesystem. Combined with namespaces, we can also give a process its own view of system resources, like its own PID, hostname, and network interfaces.

Now, let’s combine these two approaches.

$ sudo unshare --pid --fork chroot process-root sh
/ # mount -t proc proc proc
/ # ps
PID   USER      TIME  COMMAND
    1 root      0:00  sh
    3 root      0:00  ps
/ #

First, we run unshare with the PID namespace flag. You might also notice a new flag: --fork. This flag is used to create the shell process as a child of the unshare command. For the sake of this article, I won’t go into the details on that, but you can read about it in man unshare.

By combining unshare and chroot, and then mounting a new virtual /proc filesystem, we’ve successfully created a truly isolated environment. The ps command can now read from this new /proc filesystem, which is populated by the kernel with only the processes running inside our isolated “container.”

Conclusion

I hope this article was useful. Now you understand the basic mechanics behind containerization technologies like Docker. You can see that containers are essentially isolated processes that all share the same Linux kernel, which is the key difference from how virtual machines work. There’s much more to cover, such as the Mount, IPC, and Network namespaces. I hope to cover those topics in a future article. Thanks for reading!