More

ctalledo · on Jan 18, 2022

LXD is great, but one nice feature of Sysbox is that it's an OCI-based runtime, and therefore integrates with Docker, K8s, etc. In a way, Sysbox turns Docker containers or Kubernetes pods into LXD-like containers, although there are differences.

ctalledo · on Jan 17, 2022

Thanks for the feedback; I am one of the developers of Sysbox. Some answers to the above comments:

- Regarding the container isolation, Sysbox uses a combination of Linux user-namespace + partial procfs & sysfs emulation + intercepting some sensitive syscalls in the container (using seccomp-bpf). It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not. This is why we felt it was fair to put Sysbox at a similar isolation rating as gVisor, although if you view it from purely a syscall isolation perspective it's fair to say that gVisor offers better isolation. Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better). But in single-tenant environments, Sysbox does void the need for privileged containers in many scenarios because it allows well isolated containers/pods to run system workloads such as Docker and even K8s (which is why it's often used in CI infra).

- Regarding the speed rating, we gave Firecracker a higher speed rating than KubeVirt because while they both use hardware virtualization, the latter run microVMs that are highly optimized and have much less overhead that full VMs that typically run on KubeVirt. While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed).

- Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one).

- Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know.

- Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation.

- Regarding "The isolation offered by user namespaces is still very weak and not comparable to gVisor or Firecracker". User namespaces by itself mitigates several recent CVEs for containers, so it's a valuable feature. It may not offer VM-level isolation, but that's not what we are claiming. Furthermore, Sysbox uses the user-ns as a baseline, but adds syscall interception and procfs & sysfs emulation to further harden the isolation.

- "False marketing is a big red flag, especially for something as critical as a container runtime." That's not what we are doing.

- Rootless Docker/Podman are great, but they work at a different level than Sysbox. Sysbox is an enhanced "runc", and while Sysbox itself runs as true root on the host (i.e., Sysbox is not rootless), the containers or pods it creates are well isolated and void the need for privileged containers in many scenarios. This is why several companies use it in production too.

lima · on Jan 18, 2022

Thank you for taking the time to reply - happy to discuss this! :)

> It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not.

Have a look at what gVisor actually does: https://gvisor.dev/docs/architecture_guide/security

It fully implements a subset of the Linux kernel ABI in userspace, including procfs and sysfs and even memory and process management. No untrusted code ever interacts with the host kernel. Filesystem and network access goes through an IPC protocol and is handled by the gVisor processes on the host, which in turns runs inside a user namespace and a seccomp sandbox for defense in depth.

This is a much, much stronger level of isolation than your approach or, arguably, even VMs (the trade-off is performance). "Sysbox isolates the container in ways that gVisor does not" just isn't true.

The sysbox approach is one kernel bug away from host system compromise, same as using regular containers. Emulating procfs and sysfs and using user namespaces takes away some of the attack surface and is great defense in depth, but does not provide isolation from the host kernel.

> Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better)

I've read numerous claims that sysbox is suitable for untrusted workloads, for instance in [1] and [2].

It's a nice product and certainly much, much better than running docker-in-docker using privileged containers, but given the significant remaining attack surface, this claim could put your customers at risk and should come with a big disclaimer.

> While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed)

Firecracker was designed for memory efficiency, faster cold start times and security (by virtue of being written in a memory-safe language). It means you can run more containers per host, but the actual workload performance overhead is identical to "normal" VMs and, in some cases, even slightly higher since Firecracker lacks some of the optimization that has gone into QEMU.

> Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one).

KubeVirt is just plain QEMU VMs using libvirt, which have been compared to gVisor quite extensively[3][4]. There's almost no overhead for memory/CPU and quite a lot of overhead for syscalls (but with big improvements recently with the introduction of VFS2 and soon LisaFS[5]). It's a classic trade-off - gVisor is more secure and efficient than QEMU, allowing a much larger number of instances to run on a host by virtue of better cooperation with the host kernel scheduler and memory management, but for raw performance, a QEMU VM always wins.

> Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know.

Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else. See [6] for a practical example.

For containers, this used to be the case, but the situation improved in recent kernel releases.

For podman, almost every combination works - running systemd unprivileged, running podman inside podman, or even running rootless-podman-in-rootless-podman[7] and so does Kubernetes-in-rootless-{podman,docker}[8] (requiring very recent kernel features, though - notably cgroupsv2 and unprivileged overlayfs).

Running docker:dind-rootless inside unprivileged Docker containers also works, however, it requires "--security-opt seccomp=unconfined".

Sysbox definitely got to that point earlier and has better usability.

> - Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation.

Apologies, then, for misinterpreting that.

[1]: https://blog.nestybox.com/2020/10/06/related-tech-comparison...

[2]: https://github.com/nestybox/sysbox/issues/120#issuecomment-9...

[3]: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/6e4619...

[4]: https://www.scitepress.org/Papers/2021/104405/104405.pdf

[5]: https://gvisor.dev/blog/2021/12/02/running-gvisor-in-product...

[6]: https://github.com/innobead/kubefire

[7]: https://www.redhat.com/sysadmin/podman-inside-container

[8]: https://kind.sigs.k8s.io/docs/user/rootless

ctalledo · on Jan 18, 2022

Thanks again for the detailed response.

> Have a look at what gVisor actually does

I am aware of what it does, though I had missed the fact that the Sentry and/or Gopher run within a user-ns (could not find this in the docs). Had also missed the fact that it does perform procfs/sysfs emulation (makes sense), so I stand corrected on that. In light of this, I'll modify the Sysbox GH table to show gVisor as having a stronger isolation rating (in fact, our Sysbox blog comparing technologies [1] did give gVisor a stronger isolation rating).

> the sysbox approach is one kernel bug away from host system compromise

All approaches are one bug away from host system compromise (gVisor, VMs, etc.), though I agree that approaches like gVisor and VMs have a reduced attack surface.

> I've read numerous claims that sysbox is suitable for untrusted workloads

It's not a black or white determination in my view. Users choose based on their environments & needs. We always make it clear to our users that VM-based approaches provide stronger isolation, per the Sysbox GH repo:

"Isolation wise, it's fair to say that Sysbox containers provide stronger isolation than regular Docker containers (by virtue of using the Linux user-namespace and light-weight OS shim), but weaker isolation than VMs (by sharing the Linux kernel among containers)."

> Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else

That's good to know (thanks), though the table in the Sysbox GH repo meant to compare Sysbox against Kata + Firecracker (since Kata is a container runtime). To the best of my knowledge running Docker, K8s, k3s, etc. inside a Kata container is not easy (see [1] and [2]).

> For containers, this used to be the case, but the situation improved in recent kernel releases.

It's correct that rootless docker/podman approaches are improving as far as what workloads they can run inside containers, although they still have several limitations [3], [4].

With Sysbox, most of these limitations don't apply because the solution works at the more basic "runc" level, Sysbox itself is rootful, and it uses some of the techniques I mentioned before (user-ns, procfs & sysfs virtualization, syscall trapping, UID-shifting, etc.) to make the container resemble a "real host" while providing good isolation.

Good discussion, please let me know of any more feedback.

[1] https://github.com/kata-containers/kata-containers/issues/20... [2] https://github.com/daniel-noland/docker-in-kata [3] https://docs.docker.com/engine/security/rootless/#known-limi... [4] https://github.com/containers/podman/blob/main/rootless.md

ctalledo · on Nov 5, 2021

Hi HN, this is Cesar, one of the developers behind Sysbox, a next-generation "runc".

Sysbox enables containers (or pods) to act as "VM-like" environments, capable of running systemd, Docker, Kubernetes and more, seamlessly & securely.

Solves the problem of needing insecure privileged containers and complex container configs to run these workloads in containers.

It's a "runc", so it works under Docker and Kubernetes (and you can easily install it on GKE, EKS, AKS, Rancher, local cluster, etc.)

Very useful when using Docker-in-Docker or K8s-in-Docker (kind) for CI, when using containers as dev environments, or when running workloads that normally don't run in containers.

Hope you find it useful, would love to hear feedback!

ctalledo · on March 29, 2021

Thanks! Yes, if you wish to setup dev environments backed by Docker or K8s containers/pods, Sysbox is an excellent way to do so because it gives you a rootless container inside of which you can run most workloads that run in VMs.

Prior to Sysbox this required privileged containers, which offer very weak isolation from the host (not to mention it also required complex container setups/entrypoints, all of which go away with Sysbox).

ctalledo · on Nov 27, 2020

If you are using Docker-in-Docker, you may want to checkout the new Sysbox runtime (find it on Github). It's a new type of runc that sits below Docker and creates rootless containers capable of running Docker, systemd, K8s, etc. All you have to do is "docker run --runtime=sysbox-runc" <some-image-with-docker> and you'll get a docker daemon that is fully isolated from the host. It's a great way of avoiding privileged containers or mounts to the host docker socket.

ctalledo · on Aug 21, 2020

+1 for the Emacs client.

ctalledo · on Aug 11, 2020

Thanks; one thing I may have omitted mentioning is that Sysbox works with the fast overlayfs storage driver, meaning that when you do use it for Docker-in-Docker for example, both the outer Docker and the inner Docker are using overlayfs (as opposed to the slower vfs driver).

ctalledo · on Aug 10, 2020

A use case that we often get asked about for Docker-in-Docker is using the outer container as a dev environment that includes a developer's tools, ssh, and a dedicated Docker (CLI + daemon). It gives sys-admins a lighter-weight alternative to VMs for launching those dev environments, and works well in scenarios where efficiency & cost reduction is important and having VM-level isolation is not required. The problem is that prior to Sysbox, those outer containers had to be privileged containers, which provide very weak isolation (e.g., it's possible to turn off the host from within the privileged container!). With Sysbox, those outer containers are now properly isolated via the Linux user-namespace, truly enabling this use-case.

ctalledo · on Aug 9, 2020

Ubuntu carries a few things that Sysbox relies on: a couple that come to mind are the shiftfs module (which Sysbox uses to enable the user-namespace in containers without requiring Docker to be set in userns-remap mode) and a kernel patch that allows overlayfs mounts from within a user-namespace (since the Docker running inside the container uses overlayfs mounts for its inner images). Having said this, we are looking at ways of overcoming these requirements to extend Sysbox to more distros; it's one of the most asked features.

ctalledo · on Aug 8, 2020

There is plenty of info on Kubernetes (K8s) on the web, so I would start there. As far as running K8s inside Docker containers though, the use case would be one in which you want to run multiple isolated K8s clusters on a single host. One way is to use VMs, but recently people are resorting to using containers for this purpose due to their ease & efficiency. It's in the latter that Sysbox really helps, because it's capable of creating a container that runs K8s easily and with proper isolation. Typical use cases are testing, CI/CD, learning. But I would not discount this moving into production use cases in the future as the technology matures.