eggrice.soy

How we achieved Cloud-agnostic infrastructure

2021-07-17

Author // Nocoto Day
Status // Final

Goal

This blog explains how we achieved distributed, reliable, scalable and Cloud-agnostic infrastructure.

Note: Using infrastructure setup does not automagically make your application reliable. You should follow fundamental state management and distributed computing principles (not discussed in detail in this post).

Summary

Why Cloud-agnostic

Fundamentals of being Cloud-agnostic is being able to operate with any public Cloud provider.

Why would you consider Cloud-agnostic infrastructure from the start?

What are the caveats?

What are some other things to note?

Terraform to manage Clouds

To be Cloud-agnostic, you need a common configuration that applies to multiple Clouds - ie. you want to stop using the web UI to configure your VMs, etc. Terraform allows us to do this.

You will write a bunch of HCL files (imagine JSON with a bit of built-in functions) to configure your Clouds. In the following example, let's use Linode and Vultr to turn up our VMs:

main.tf

terraform {
  required_providers {
    linode = {
      source = "linode/linode"
    }
    vultr = {
      source = "vultr/vultr"
    }
  }
}

provider "vultr" {
  # export VULTR_API_KEY=key
}

provider "linode" {
  # export LINODE_TOKEN=token
}

This simply defines which providers are involved. Instead of checking-in your secret keys, please set environment variables in your local machine.

In addition, we have a separate development domain that is managaed via Linode (so that we don't have to type IPs around or hardcode any). You can use any DNS to achieve this, ex. Cloudflare:

domains.tf

resource "linode_domain" "eggricesoy" {
    type = "master"
    domain = "your_domain"
    soa_email = "[email protected]"
}

You can now actually define VMs and associated domain names for each. For example:

dev.tf

resource "linode_instance" "nomad-dev-COUNTRY_CODE-CITY_CODE0" {
  label          = "nomad-dev-COUNTRY_CODE-CITY_CODE0"
  image          = "linode/debian10"
  region         = "your_region"
  type           = "your_machine_type"
}

resource "linode_domain_record" "nomad-dev-COUNTRY_CODE-CITY_CODE0" {
  domain_id = linode_domain.eggricesoy.id
  name = "dev-COUNTRY_CODE-CITY_CODE0"
  record_type = "A"
  target = linode_instance.nomad-dev-COUNTRY_CODE-CITY_CODE0.ip_address
}

resource "vultr_instance" "nomad-dev-COUNTRY_CODE-CITY_CODE1" {
  plan = "your_machine_type"
  region = "your_region"
  # Debian 10
  os_id = "352"
}

resource "linode_domain_record" "nomad-dev-COUNTRY_CODE-CITY_CODE1" {
  domain_id = linode_domain.eggricesoy.id
  name = "dev-COUNTRY_CODE-CITY_CODE1"
  record_type = "A"
  target = vultr_instance.nomad-dev-COUNTRY_CODE-CITY_CODE1.main_ip
}

Documentation for each providers and modules can be found at registry.terraform.io. They are mostly well-written documented and easy to find.

To push infrastructure intent to reality, you run:

$ terraform apply

If you want to take it further, you may want to define start-up scripts for each Cloud provider and run initial scripts to prepare your VMs. I have a script that installs Docker and Nomad, then schedules Nomad on startup. I just do ssh root@domain "bash -s" < script however.

Docker containers for release

Containerise your releases and promote the containers through a scheduled release process.

When you containerise, you want to embed flags to go with the binary together. The huge benefit of doing this is easy rollbacks when things go wrong. Flags appear and disappear as source code evolves, and only certain individuals in your team may know the correct set of values (ex. optimization flags). You want to version your flag changes together with the code, at all times.

Each release stage should have constant tests and integrations to verify a successful release. For example:

Nomad to run containers

Nomad is like Kubernetes but with actually readable and discoverable tutorials and reference docs. Nomad manages and runs your binaries for you, according to the job configuration.

I personally like Nomad better due to its simplicity of setting up. I can't be bothered configuring and setting up Kube myself. I usually rely on the Cloud provider to configure Kube cluster for me. This means one cluster = one Cloud provider, hence if that Cloud provider explodes, my cluster is lost.

With Nomad, you can mix and match providers across various regions. If setup correctly, your cluster will not go down even if one of the Cloud providers explode.

In Nomad, you have "servers" that do job orchestration and "clients" that actually run Docker containers. There are a couple of things to keep in mind when setting up Nomad cluster:

Configuring jobs in Nomad uses the same language as Terraform (HCL). For example, to run nginx with no rollout policy and in the raw-est form possible:

job "nginx" {
  region = "REGION"
  datacenters = ["CITY_CODE"]

  group "nginx" {
    count = 1

    network {
      port "http" {
        to = 80
      }
    }

    service {
      name = "nginx"
      port = "http"
    }

    task "nginx" {
      driver = "docker"

      config {
        image = "nginx"
        ports = ["http"]
      }
    }
  }
}

Other things to keep in mind

As mentioned in the beginning, simply using this infrastructure does not make your service more reliable or scalable. However, this infrastructure will set you strong foundations to run a stable service without blowing your budget on hiring SREs.

After you've set things up following this blog post, you should look into how binary rollouts should happen across the globe and how to manage service-level states.