Contents:
A container manager is responsible for a subset of the functionality required to run a linux container. Typically, this functionality includes:
Container managers themselves need to rely on other pieces of software (both above and below them in the stack) to provide their desired functionality.
For the initial implementation, we will implement the following functionality:
As a user, I interact with my container manager by telling it to do something like “create a busybox container” or “retrieve the status of my nginx container”. A typical interaction might look like the following:
This means our container manager needs both server and client components. More specifically, the container manager itself is a daemon (the server component) that exposes its API to users like me via a CLI (the client component).The container manager needs to prepare container bundles, maintain persistent container state to survive restarts, communicate with the low-level container runtime (runc or a runtime shim), and service client requests. Note that most container managers also provide image pulling/caching/unpacking, but we will not be implementing this piece. Based on these requirements, we can roughly break our container manager down into the following components:
ContainerManager
in our source code.ContainerStore
.ContainerRuntime
.ContainerMap
.We’ve established that a container manager needs to prepare container bundles, maintain persistent and in-memory state, restart gracefully, and leverage the low-level container runtime. To zoom in on the details and understand how these overlapping responsibilities are fulfilled, let’s explore the logic behind each API call:
container create
container start
container stop
container get
container list
container delete
Here is an example container create command from the client’s perspective:
vagrant@vagrant:~$ bin/client container create my_container --rootfs=~/tmp/rootfs/ echo Hello World
bin/client container create
→ invoke the client executable and specify that we’re performing a container creationmy_container
→ the container name--rootfs=~/tmp/rootfs
→ specify the container’s root filesystemecho
→ the command to execute in our containerHello World
→ the args provided to our commandcreated: 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2We’re given back the container ID
6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2
. We can use this ID in future interactions with the container manager. As input, the container manager accepts a container name, root filesystem, command and arguments. It processes the request, and as output it produces a container ID. Under the hood, here’s what the container manager needed to do:6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2
lib_root/containers/6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2
lib_root/containers/6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2/bundle
. Creating the container bundle requires us to:lib_root/containers/6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2/bundle/rootfs
lib_root/containers/6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2/bundle/config.json
. Update the spec file with provided command echo
and arguments Hello World
create
request:/// create_container does the following:
/// - invoke create_container_helper to create the container
/// - on an error, invoke rollback_container_create to clean up leftover
/// state, including in-memory container and container directory on disk
pub fn create_container(
&self,
opts: ContainerOptions,
) -> Result<String, ContainerManagerError> {
self.create_container_helper(opts).or_else(|err| {
// best effort rollback
self.rollback_container_create(&err.container_id);
return Err(err.source);
})
}
/// create_container_helper does the following:
/// - generate container id
/// - create and store the in-memory container structure
/// - create the container directory on disk
/// - create the container bundle:
/// - copy the rootfs into the container bundle
/// - generate the runc spec for the container
/// - create the container (runc exec)
/// - update container status, write those to disk
fn create_container_helper(
&self,
opts: ContainerOptions,
) -> Result<String, InternalCreateContainerError> {
// generate container id
let container_id = rand_id();
// create & store in-memory container structure
let container: Container =
new_container(&container_id, &opts.name, &opts.command, &opts.args);
let container_id =
self.container_map
.add(container)
.map_err(|err| InternalCreateContainerError {
container_id: container_id.clone(),
source: err.into(),
})?;
// create container directory on disk
self.container_store
.create_container_directory(&container_id)
.map_err(|err| InternalCreateContainerError {
container_id: container_id.clone(),
source: err.into(),
})?;
// create container bundle on disk
let container_bundle_dir = self
.container_store
.create_container_bundle(&container_id, &opts.rootfs_path)
.map_err(|err| InternalCreateContainerError {
container_id: container_id.clone(),
source: err.into(),
})?;
// create container runtime spec on disk
let spec_opts =
RuntimeSpecOptions::new(container_bundle_dir.clone(), opts.command, opts.args);
self.container_runtime
.new_runtime_spec(&spec_opts)
.map_err(|err| InternalCreateContainerError {
container_id: container_id.clone(),
source: err.into(),
})?;
// create container
let create_opts = RuntimeCreateOptions::new(
container_bundle_dir.clone(),
"container.pidfile".into(),
container_id.clone(),
);
self.container_runtime
.create_container(create_opts)
.map_err(|err| InternalCreateContainerError {
container_id: container_id.clone(),
source: err.into(),
})?;
// update container creation time, status, and persist to disk
self.update_container_created_at(&container_id, SystemTime::now())
.map_err(|source| InternalCreateContainerError {
container_id: container_id.clone(),
source,
})?;
self.update_container_status(&container_id, Status::Created)
.map_err(|source| InternalCreateContainerError {
container_id: container_id.clone(),
source,
})?;
self.atomic_persist_container_state(&container_id)
.map_err(|source| InternalCreateContainerError {
container_id: container_id.clone(),
source,
})?;
Ok(container_id)
}
Created
state, which means it’s ready to be run!Container startOnce our container is in a Created
state, it is ready to be started. The start command looks as follows from the client’s perspective:
vagrant@vagrant:~$ bin/client container start 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2and the response will be:
started: 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2As input, the container manager needed a container ID that is in a created state, and after handling the request, it output that container ID back to us. Under the hood, the container manager needs to do the following things when it receives a start request:
Created
state.Running
) and persist this state to disk.start
request:/// start_container does the following:
/// - ensure container exists and is in created state
/// - start the container via the container runtime
/// - update container start time and status, then persist
pub fn start_container(&self, container_id: &ID) -> Result<(), ContainerManagerError> {
// ensure container exists and is in created state
match self.container_map.get(container_id) {
Ok(container) => {
if container.status != Status::Created {
return Err(
ContainerManagerError::StartContainerNotInCreatedStateError {
container_id: container_id.clone(),
},
);
}
}
Err(err) => return Err(err.into()),
}
// container start
self.container_runtime.start_container(container_id)?;
// update container start time and status in memory, then persist to disk
// this current approach just optimistically sets the container to
// running and allows future calls to get/list to synchronize with runc.
// one other way we could consider doing this is polling runc until we
// see that the container is running and then updating.
self.update_container_started_at(&container_id, SystemTime::now())?;
self.update_container_status(&container_id, Status::Running)?;
self.atomic_persist_container_state(&container_id)
}
Running
state. We can now get
the container from our container manager to follow its status, or stop
the container. Let’s look at the get
command next, since we need to explore how we handle the state of containers that exit on their own.Container get (and list)The get
command retrieves the status of a container, whether it’s in a Created
, Running
, or Stopped
state. Directly after creating the container, the get
request looks as follows from the client’s perspective:
vagrant@vagrant:~$ bin/client container get 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2and the response will be:
echo Hello World
which we should expect to exit almost immediately. This means that after starting the container, our next call to get
should recognize that the container has stopped, updating the status in the process. For example:vagrant@vagrant:~$ cruise_client container start 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2 && sleep 2
vagrant@vagrant:~$ cruise_client container get 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2will inform us that the status is now
Stopped
. Under the hood, this requires the container manager to synchronize itself with the low-level container runtime to determine the current status of the container before reporting it back to the user. Note that a container runtime shim often handles this synchronization on an event driven basis for the container manager. The shim is also able to provide more rich information such as a precise finishing time and the exit code of the container process. Our container manager is going to be comparatively simple, and in the future will integrate with a container runtime shim to provide a more full picture. To handle a get
request, our container manager does the following:get
request, the container manager does not know if the container has exited. To perform this reconciliation, the container manager retrieves the current state of the container from the low-level container runtime, and updates its in-memory and on-disk stores.get
request:/// get_container does the following:
/// - synchronize container state with the container runtime, which fails
/// if the container does not exist
/// - return container state from memory
pub fn get_container(
&self,
container_id: &ID,
) -> Result<Box<Container>, ContainerManagerError> {
self.sync_container_status_with_runtime(container_id)?;
self.container_map
.get(container_id)
.map_err(|err| err.into())
}
/// list_containers does the following:
/// - for every known container, synchronize container state with the
/// container runtime, which fails if any of the containers do not exist
/// - return container states from memory
pub fn list_containers(&self) -> Result<Vec<Container>, ContainerManagerError> {
match self.container_map.list() {
Ok(containers) => {
for container in containers.iter() {
self.sync_container_status_with_runtime(container.id())?;
}
}
Err(err) => return Err(err.into()),
};
self.container_map.list().map_err(|err| err.into())
}
list
command is just a generalized form of a container get
. Instead of synchronizing and returning a specific container ID, the container manager goes through its entire in-memory store, refreshing the state of all containers, and returning those results back to the client in their entirety.Container stopOnce our container is in a Running
state, if we don’t wish for it to continue running (and it has not yet exited), we can choose to stop
the container. Note that stopping and deleting a container are distinct operations, and delete
can only proceed if a container is in a Stopped
state (or if it has not yet been started). The stop
command looks as follows from the client’s perspective:
vagrant@vagrant:~$ bin/client container stop 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2and the response will be:
stopped: 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2If we were to
get
this container, it would now be reporting a Stopped
status. Under the hood, our container manager does the following in response to a stop
request:Running
state via its in-memory store.Running
status, the container manager instructs the low-level container runtime to kill the container process (by sending it a SIGTERM
or SIGKILL
).Stopped
, storing the changes in-memory and persisting to disk as well.stop
request:/// stop_container does the following:
/// - ensure container exists and is in running state
/// - send a SIGKILL to the container via the container runtime
/// - update container status, then persist
pub fn stop_container(&self, container_id: &ID) -> Result<(), ContainerManagerError> {
// ensure container exists and is in running state
match self.container_map.get(container_id) {
Ok(container) => {
if container.status != Status::Running {
return Err(ContainerManagerError::StopContainerNotInRunningStateError {
container_id: container_id.clone(),
});
}
}
Err(err) => return Err(err.into()),
}
// send SIGKILL to container via the container runtime
self.container_runtime.kill_container(container_id)?;
// update container status and persist to disk
self.update_container_status(&container_id, Status::Stopped)?;
self.atomic_persist_container_state(&container_id)
}
Stopped
state. The container metadata still exists in our container manager and the low-level container runtime, and the container filesystem is still present on disk if we wished to inspect it. Our container is now eligible for deletion, which will free those resources.Container deleteOnce a container is in a Stopped
state, it is eligible for deletion, which will free the resources it occupies even after the container process terminates. The delete command, unsurprisingly, looks like the following from the client’s perspective:
vagrant@vagrant:~$ bin/client container delete 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2and the response will be:
deleted: 6bce9dc1-0a03-4bb7-86f5-dd75fdae7fa2At this point, the container and the resources it occupies no longer exist. Calls to
get
will fail, and calls to list
will not find it. Under the hood, the container manager does the following in response to a delete
call:Stopped
state (it is also valid to delete containers that are not yet started, meaning they are in a Created
state).Stopped
or Created
status, the container manager tells the low-level container runtime to delete the container, which allows it to clean up its state related to this container.delete
request:/// delete_container does the following:
/// - ensure container exists and is in stopped state
/// - tell the container runtime to delete the container
/// - remove remnants of container in memory and on disk
pub fn delete_container(&self, container_id: &ID) -> Result<(), ContainerManagerError> {
// ensure container exists and is in stopped state
match self.container_map.get(container_id) {
Ok(container) => {
if container.status != Status::Stopped && container.status != Status::Created {
return Err(
ContainerManagerError::DeleteContainerNotInDeleteableStateError {
container_id: container_id.clone(),
},
);
}
}
Err(err) => return Err(err.into()),
}
// instruct container runtime to delete container
self.container_runtime.delete_container(container_id)?;
// remove container from memory and disk
self.container_map.remove(&container_id);
self.container_store
.remove_container_directory(&container_id);
Ok(())
}
Until now, we’ve examined how a container manager carries out its explicit responsibilities of manipulating containers in response to client requests. As we stated earlier, another requirement is that the container manager be able to survive restarts without impacting the containers it manages. This allows us to upgrade the container manager on the fly, and more importantly, it decouples the stability of the containers from the stability of the container manager. This means that if the container manager were to crash for some reason, our existing containers would continue running happily. In discussing the business logic behind create
, start
, stop
, get
, list
, and delete
, we’ve already encountered some of the required groundwork that allows our container manager to survive restarts. Specifically, we persist container state to disk at every possible opportunity. This is important because that on-disk state is used to reconstruct the state of the world at container manager startup time. In fact, every time the container manager starts, it will attempt to reconstruct the state of the world, even the first time it starts on a machine (in which case the reconstruction is a no-op). It’s important to recognize we have no guarantee that the state of the world at restart is the same as the state of the world when the container manager last stopped. This means that in addition to reloading the state from disk, the container manager needs to resync the state of every container it knew about prior to stopping. The goals of the restart are as follows:
/// reload does the following:
/// - reads all container state files off disk
/// - if any of these state files fail to be parsed, we assume the
/// container is corrupted and remove it
/// - adds the container to the in-memory store
/// - syncs the container state with the container runtime
fn reload(&self) -> Result<(), ContainerManagerError> {
// get container ids off disk
let container_ids = self
.container_store
.list_container_ids()
.map_err(|source| ContainerManagerError::ReloadError { source })?;
for container_id in container_ids {
// parse container state file
let container = match self.container_store.read_container_state(&container_id) {
Ok(container) => container,
Err(err) => {
error!(
"unable to parse state of container `{}`, err: `{}`. Removing container.",
container_id, err
);
self.container_store
.remove_container_directory(&container_id);
continue;
}
};
// add container to in-memory store
match self.container_map.add(container) {
Ok(_) => (),
Err(err) => {
error!(
"unable to add container `{}` to in-memory state, err: `{:?}`. Continuing.",
container_id, err
);
continue;
}
}
// sync container with container runtime
match self.sync_container_status_with_runtime(&container_id) {
Ok(_) => (),
Err(err) => {
error!(
"unable to sync state of container `{}`, err: `{:?}`. Removing container.",
container_id, err
);
self.container_store
.remove_container_directory(&container_id);
self.container_map.remove(&container_id);
continue;
}
}
}
Ok(())
}
I ran this in a Vagrant box with Ubuntu 18.04. Because our container manager shells out to runc under the hood, it needed to be a linux distro to work properly. Here’s what I put in my Vagrantfile
:
Vagrant.configure("2") do |config|
config.vm.box = "hashicorp/bionic64"
end
# in the directory with your Vagrantfile, setup Vagrant box
$ vagrant up
# login
$ vagrant ssh
# install gcc
$ sudo apt-get update
$ sudo apt install -y gcc
# install rust, this takes a moment
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# add cargo to your PATH environment variable
$ source $HOME/.cargo/env
# install docker (used to create container root filesystem)
$ sudo curl -sSL https://get.docker.com/ | sh
# add the vagrant user to the docker group so we don't need to run docker commands as root
$ sudo usermod -aG docker vagrant
# logout and back in for the group change to take effect
$ logout
$ vagrant ssh
# clone the project
$ git clone https://github.com/willdeuschle/cruisecd cruise
# build the project (daemon and client)
$ cargo build
# start the daemon, specifying its root directory and the path to runc
$ target/debug/daemon run --lib_root=./tmp/lib_root --runtime_path=/usr/bin/runc
# from a new shell, for the client
# in the directory with your Vagrantfile, login to your Vagrant box
$ vagrant ssh
# create rootfs for container
$ cd cruise && mkdir -p tmp/rootfs
$ docker export $(docker create busybox) | tar -C tmp/rootfs -xf -
# create container
$ target/debug/client container create my_container --rootfs=tmp/rootfs/ sh -- -c "echo hi; sleep 60; echo bye"
> created: 3a92e711-034e-410f-8aa9-700ae23c3a8d
# start container
$ target/debug/client container start 3a92e711-034e-410f-8aa9-700ae23c3a8d
> started: 3a92e711-034e-410f-8aa9-700ae23c3a8d
hi
output over there from our new container!# back in the daemon shell
$ target/debug/daemon run --lib_root=./tmp/lib_root --runtime_path=/usr/bin/runc
...
> hi
# from the client shell
# get container status
$ target/debug/client container get 3a92e711-034e-410f-8aa9-700ae23c3a8d
> ID NAME STATUS EXIT_CODE CREATED_AT STARTED_AT FINISHED_AT COMMAND ARGS
3a92e711-034e-410f-8aa9-700ae23c3a8d my_container Running -1 2020-08-30T23:46:45.788010499+00:00 2020-08-30T23:47:19.355861796+00:00 n/a sh -c, echo hi; sleep 60; echo bye
Running
state (while it sleeps). After a minute, in our daemon shell, we will see bye
output from our container:# back in the daemon shell
$ target/debug/daemon run --lib_root=./tmp/lib_root --runtime_path=/usr/bin/runc
...
> hi
...
> bye
Stopped
state:# from the client shell
# get container status
$ target/debug/client container get 3a92e711-034e-410f-8aa9-700ae23c3a8d
> ID NAME STATUS EXIT_CODE CREATED_AT STARTED_AT FINISHED_AT COMMAND ARGS
3a92e711-034e-410f-8aa9-700ae23c3a8d my_container Stopped -1 2020-08-30T23:46:45.788010499+00:00 2020-08-30T23:47:19.355861796+00:00 n/a sh -c, echo hi; sleep 60; echo bye
# from the client shell
# delete container
$ target/debug/client container delete CONTAINER_ID
> deleted: 3a92e711-034e-410f-8aa9-700ae23c3a8d
# list containers
$ target/debug/client container list
> ID NAME STATUS EXIT_CODE CREATED_AT STARTED_AT FINISHED_AT COMMAND ARGS