ekzhang a day ago

Hi! This is a blog post sharing some low-level Linux networking we're doing at Modal with WireGuard.

As a serverless platform we hit a bit of a tricky tradeoff: we run multi-tenant user workloads on machines around the world, and each serverless function is an autoscaling container pool. How do you let users give their functions static IPs, but also decouple them from compute resource flexibility?

We needed a high-availability VPN proxy for containers and didn't find one, so we built our own on top of WireGuard and open-sourced it at https://github.com/modal-labs/vprox

Let us know if you have thoughts! I'm relatively new to low-level container networking, and we (me + my coworkers Luis and Jeffrey + others) have enjoyed working on this.

  • crishoj a day ago

    Neat. I am curious what notable differences there are between Modal and Tailscale.

    • ekzhang a day ago

      Thanks. We did check out Tailscale, but they didn't quite have what we were looking for: some high-availability custom component that plugs into a low-level container runtime. (Which makes sense, it's pretty different from their intended use case.)

      Modal is actually a happy customer of Tailscale (but for other purposes). :D

      • mrbluecoat a day ago

        So if a company only needs an outbound VPN for their road warriors and not an inbound VPN to access internal servers, vprox could be a simpler alternative to Tailscale?

  • xxpor a day ago

    You're using containers as a multi-tenancy boundary for arbitrary code?

    • ekzhang a day ago

      We use gVisor! It's an open-source application security sandbox spun off from Google. We work with the gVisor team to get the features we need (notably GPUs / CUDA support) and also help test gVisor upstream https://gvisor.dev/users/

      It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.

      • doctorpangloss a day ago

        Are these the facts?

        - You are using a container orchestrator like Kubernetes

        - You are using gVisor as a container runtime

        - Two applications from different users, containerized, are scheduled on the same node.

        Then, which of the following are true?

        (1) Both have shared access to an NVIDIA GPU

        (2) Both share access to the NVIDIA GPU via CUDA MPS

        (3) If there were 2 or more MIGs on the node with a MIG-supporting GPU, the NVIDIA container toolkit shim assigned a distinct MIG to each application

        • ekzhang a day ago

          We don't use Kubernetes to run user workloads, we do use gVisor. We don't use MIG (multi-instance GPU) or MPS. If you run a container on Modal using N GPUs, you get the entire N GPUs.

          If you'd like to learn more, you can check out our docs here: https://modal.com/docs/guide/gpu

          Re not using Kubernetes, we have our own custom container runtime in Rust with optimizations like lazy loading of content-addressed file systems. https://www.youtube.com/watch?v=SlkEW4C2kd4

          • ec109685 a day ago

            If the nvidia driver has a bug, can one workload access data of another running on the physical machine?

            E.g. it came up in this thread: https://news.ycombinator.com/item?id=41672168

            • JoshuaJB a day ago

              Yes. The kernel has access to data from every workload, and so technically a bug in _anything_ running at kernel level could result in data leakage.

          • doctorpangloss a day ago

            Suppose I ask for two H100s. Will I have GPU P2P capabilities?

            • ekzhang a day ago

              Yep! This is something we have internal tests for haha, you have good instincts that it can be tricky. Here's an example of using that for multi-GPU training https://modal.com/docs/examples/llm-finetuning

              • doctorpangloss a day ago

                Okay, well think very deeply about what you are saying about isolation; the topology of the hardware; and why NVIDIA does not allow P2P access even in vGPU settings except in specific circumstances that are not yours. I think if it were as easy to make the isolation promises you are making, NVIDIA would already do it. Malformed NVLink messages make GPUs fall off the bus even in trusted applications.

  • dangoodmanUT a day ago

    As a predominantly rust shop, why choose go for this?

    • klooney a day ago

      Wireguard's premiere user space implementation is in go.

      • rendaw a day ago

        Premiere = official? Do you know how it compares to this by Cloudflare? https://github.com/cloudflare/boringtun

        • ekzhang 19 hours ago

          Sorry, just realized the misunderstanding. To clarify Modal still uses the kernel WireGuard module. The userspace part that’s in Go and not in other languages that we use is the wgctrl library.

      • ekzhang a day ago

        This is it. I like Rust a lot, but you gotta pick the right tool for the job sometimes

jimmyl02 a day ago

this is a really neat writeup! the design choice to make each "exit node" control the local wireguard connections instead of a global control plane is pretty neat.

an unfinished project I worked on (https://github.com/redpwn/rvpn) was a bit more ambitious with a global control plane and I quickly learned supporting multiple clients especially anything networking related is a tarpit. the focus on linux / aws specifically here and the results achievable from it are nice to see.

networking is challenging and this was a nice deep dive into some networking internals, thanks for sharing the details :)

  • ekzhang a day ago

    Thanks for sharing. I'm interested in seeing what a global control plane might look like, seems like authentication might be tricky to get right!

    Controlling our worker environment (like `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us since it means we don't have to deal with the fullness of all possible network configurations.

qianli_cs a day ago

Thanks for sharing. This new feature is neat! It might sound a bit out there, but here's a thought: could you enable assigning unique IP addresses to different serverless instances? For certain use cases, like web scraping, it's helpful to simulate requests coming from multiple locations instead of just one. I think allowing requests to originate from a pool of IP addresses would be doable given this proxy model.

heinternets a day ago

So much work seems to go into working around the limitations of IPv4 instead of towards a fully IPv6 capable world.

  • klysm a day ago

    Unfortunately we gotta do both. Overlay networks like wireguard might be a good stepping stone to move software towards IPv6 anyway

cactacea a day ago

Static IPs for allowlists need to die already. Its 2024, come on, surely we can do better than this

  • ekzhang a day ago

    What would you suggest as an alternative?

    • thatfunkymunki a day ago

      a more modern, zero-trust solution like mTLS authentication

      • ekzhang a day ago

        That makes sense, mTLS is great. Some services like Google Cloud SQL are really good about support for it. https://cloud.google.com/sql/docs/mysql/configure-ssl-instan...

        It's not quite a zero-trust solution though due to the CA chain of trust.

        mTLS is security at a different layer though than IP source whitelisting. I'd say that a lot of companies we spoke to would want both as a defense-in-depth measure. Even with mTLS, network whitelisting is relevant. If your certificate were to be exposed for instance, an attacker would still need to be able to forge a source IP address to start a connection.

        • PLG88 a day ago

          If mTLS is combined with outbound connections, then IP source whitelisting is irrelevant; the external network cannot connect to your resources.

          This (and more) is exactly what we (I work on it) built with open source OpenZiti, a zero trust networking platform. Bonus points, it includes SDKs so you can embed ZTN into the serverless function, a colleague demonstrated it with a Python workload on AWS - https://blog.openziti.io/my-intern-assignment-call-a-dark-we....

        • thatfunkymunki a day ago

          I'd put it in the zero-trust category if the server (or owner of the server, etc) is the issuer of the client certificate and the client uses that certificate to authenticate itself, but I'll admit this is a pedantic point that adds nothing of substance. The idea being that you trust your issuance of the certificate and the various things that can be asserted based on how it was issued (stored in TPM, etc), rather than any parameter that could be controlled by the remote party.

    • sofixa a day ago

      JWT/OIDC, where the thing you're authenticating to (like MongoDB Atlas) trusts your identity provider (AWS, GCP, Modal, GitLab CI). It's better than mTLS because it allows for more flexibility in claims (extra metadata and security checks can be done with arbitrary data provided by the identity provider), and JWTs are usually shorter lived than certificates.

      • Thaxll a day ago

        How do you allow a driver using that exactly?

        • sofixa a day ago

          A db connection driver? You pass the JWT as the username/password which contains the information about your identity and is signed by the identity provider that the party you're authenticating to has been configured to trust.

          Or, you use a broker like Vault to which you authenticate with that JWT, and which generates a just in time ephemeral username/password for your database, which gets rotated at some point.

  • klysm a day ago

    Completely agree. IP addresses are almost never a good means of authentication. It results in brittle and inflexible architecture as well. Applications become aware of layers they should be abstracted from

    • bogantech a day ago

      Firewalls exist, many network environments block everything not explicitly allowed.

      Authentication is only part of the problem, networks are firewalled (with dedicated appliances) and segmented to prevent lateral movement in the event of a compromise

      • klysm a day ago

        Isn’t that completely orthogonal? IP addresses aren’t authenticated, they can be spoofed

        • bogantech a day ago

          It's not authentication. People aren't using static ips for authentication purposes

          But if I have firewall policies that allow connections only to specific services I need a destination address and port (yes, some firewalls allow host names but there's drawbacks to that)

          > IP addresses aren't authenticated, they can be spoofed

          For anything bidirectional you'd need the client to have a route back to you for that address, which would require you compromising some routers and advertising it via BGP etc.

          You can spoof addresses all you want but it will generally not do much for a stateful protocol

          • klysm 17 hours ago

            > People aren't using static ips for authentication purposes

            Unfortunately they are! I’ve seen up whitelistijg used as the only means of authentication over the WAN several times

          • otabdeveloper4 a day ago

            > People aren't using static ips for authentication purposes

            Lol. Of course they do. In fact, it's the only viable way to authenticate servers in Current Year. Unlike ssh host keys, of which literally nobody on this planet takes seriously, or https certificates which is just make-work security theater.

            • klysm 17 hours ago

              Now this is an interesting take - I can’t tell if you are being serious

ATechGuy a day ago

> Modal has an isolated container runtime that lets us share each host’s CPU and memory between workloads.

Looks like Modal hosts workloads in Containers, not VMs. How do you enforce secure isolation with this design? A single kernel vulnerability could lead to remote execution on the host, impacting all workloads . Am I missing anything?

  • ekzhang a day ago

    I mentioned this in another comment thread, but we use gVisor to enforce isolation. https://gvisor.dev/users/

    It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.

    • yegle a day ago

      And Google's own serverless offerings (App Engine, Cloud Run, Cloud Functions) :-)

      Disclaimer: I'm an SRE on the GCP Serverless products.

      • ekzhang a day ago

        Neat, thanks for sharing! Glad to know we're in good company here.

klysm a day ago

Why is it important to have a static outbound ip address?

techn00 a day ago

side question: what do you use to make the diagrams?

stuckkeys a day ago

This is just what I needed. Chefs kiss.

fusjdffddddddds a day ago

It's going to take years for orgs to adopt IPv6 and mTLS+JWT/OIDC.

Even longer for QUIC/H3.

  • klysm a day ago

    I’m not convinced that mTLS or OIDc are good ideas

    • fusjdffddddddds a day ago

      ... Are you going to say why?

      • klysm a day ago

        I could go into detail, but it was stated as if these technologies should unambiguously be adopted

        • PLG88 a day ago

          I am very curious, because I do think they are unambiguously a good thing.

eqvinox a day ago

I guess my first question is, why is this built on IPv4 rather than IPv6...

  • ekzhang a day ago

    Yeah, great question. This came up at the beginning of design. A lot of our customers specifically needed IPv4 whitelisting. For example, MongoDB Atlas (a very popular database vendor) only supports IPv4. https://www.mongodb.com/community/forums/t/does-mongodb-atla...

    The architecture of vprox is pretty generic though and could support IPv6 as well.

    • eqvinox a day ago

      I guess that works until other customers need access to IPv6-only resources… (e.g.: we've stopped rolling IPv4 to any of our CI. No IPv6, no build artifacts…)

      In a perfect world I'd also be asking whether you considered NAT64, but unfortunately I'm well aware that's a giant world of pain to get to work on Linux (involving either out-of-tree Jool, or full-on VPP)

      • ekzhang a day ago

        Yeah, you hit the nail on the head. We considered NAT64 as well and looked at some implementations including eBPF-based ones like Cilium.

        Glad to know that IPv6-only is working well for you. "In a perfect world…" :)

        • eqvinox a day ago

          It is what it is :/ … I do periodically ask these questions to track how v4-vs-v6 things are developing, and they're moving, albeit at a snail's pace.

          (FTR, it works for us because our CI is relatively self-contained. And we have local git mirrors… f***ing github…)

          • lowpro a day ago

            At my company (Fortune 100), we've been selling a lot of our public v4 space to implement... RFC1918 space. We've re-IP'd over 50,000 systems so far to private space. We just implemented NAT for the first time ever. I was surprised to see how far behind some companies are.

            • eqvinox a day ago

              Progress is coming from the weirdest corners… US DoD and NATO require IPv6 feature-parity to IPv4 nowadays, no full IPv6 = no bidding on tenders…

              (I would already have expected this to be quite effective in forcing IPv6, but tbh I'm still surprised just how effective.)

nodesocket a day ago

Couldn't a NAT instance in-front of containers accomplish this as well (assuming only needed for outbound traffic)? The open source project fck-nat[1] looks amazing for this purpose.

[1] https://fck-nat.dev/stable/

  • ekzhang a day ago

    Right, vprox servers act as multiplexed NAT instances with a VPN attached. You do still need the VPN part though since our containers run around the world, in multiple regions and availability zones. Setting the gateway to a machine running fck-nat would only work if that machine is in the same subnet (e.g., for AWS, in one availability zone).

    The other features that were hard requirements for us were multi-tenancy and high availability / failover.

    By the way, fck-nat is just a basic shell script that sets the `ip_forward` and `rp_filter` sysctls and adds an IP masquerade rule. If you look at vprox, we also do this but build a lot on top of it. https://github.com/modal-labs/vprox

    • nodesocket a day ago

      Ahh that makes sense. I do think that a single fck-nat instance can service multiple AZ's though in a AWS region. Just need to adjust the VPC routing table. Thanks for the reply and info.