Deploying CoreOS on the Cloudhelix Platform

One of our core offerings at Cloudhelix is complex cloud hosting?—?providing and managing a platform for a customer that is provisioned and load-balanced across multiple providers. CoreOS is a good fit for this model as a single compute cluster can run across multiple platforms, such as our own VMware private cloud platform, and on Amazon EC2 or Google Compute Engine.

Following some customer interest, we have uploaded the latest Alpha release of CoreOS (349.0.0) to our vCloud Director catalog for general experimentation 🙂

Summary

CoreOS is a minimal, fast-booting OS that can be used to build massively-scalable, resilient compute clusters. It has three main building blocks:

etcd: etcd is a highly-available, distributed key value store, used for service discovery and storage of configuration values.
docker: the docker engine is used to create containers, which are used to run applications in isolation?—?this differs from a virtual machine, as it is not a complete guest OS?—?just the application and its dependencies.
fleet: fleet is used to deploy containers across your cluster, and is able to provide high-availability based on configured conditions (such as separating containers across machines).

More information: https://coreos.com/using-coreos

Detail

The following will briefly demonstrate the following:

Building a CoreOS cluster.
Creating a webserver image to be run with docker.
Using fleet to deploy multiple copies of the image across multiple hosts, using metadata and high-availability rules to determine placement.
Adding keys to etcd that record that a webserver is running along with details of its host?—?it’s easy to see that this information could then be read by a loadbalancer to automatically add these servers into its configuration.

Initial boot

By default, the CoreOS VM will try to acquire a DHCP address, so make sure that a DHCP server (configured on your vShield Edge or elsewhere) is available. Once you have deployed the CoreOS VM from the catalog, check the console for its IP address.

You will then be able to SSH to it?—?this will use a key as opposed to a password (AWS users will be familiar with this). For testing purposes we have left the default ‘insecure_ssh_key’ on the template, although please note that this should not be used in production. The key can be extracted from: http://alpha.release.core-os.net/amd64-usr/current/coreos_production_vmware_insecure.zip

SSH to the VM: ssh -i insecure_ssh_key core@10.1.1.2

Cloud-config

A config file is required, which can be attached as an ISO or copied to a file once connected via SSH.

When creating an ISO:

mkdir -p /tmp/new-drive/openstack/latest
vi user_data
#insert config
cp user_data /tmp/new-drive/openstack/latest/user_data
mkisofs -R -V config-2 -o configdrive.iso /tmp/new-drive

Your desired config can also be copied into /usr/share/oem/cloud-config.yml. In our example we have generated a token from https://discovery.etcd.io/new, which the machines use with the discovery API to discover other peers in a cluster. Once generated you can browse to https://discovery.etcd.io/insert-token-here to inspect the list of peers.

We have also configured a static IP address, and set some metadata about the machine (a disk type of SSD, and Cloudhelix as the Cloud provider) for fleet to use later. We could also configure similar metadata to define a machine running on AWS. Fleet can then use this information to make scheduling decisions for your containers.

#cloud-config
hostname: coreos-n
coreos:
 fleet:
 metadata: provider=cloudhelix,disk=ssd
 etcd:
 discovery: https://discovery.etcd.io/
 addr: 10.1.1.2:4001
 peer-addr: 10.1.1.2:7001
 units:
 — name: etcd.service
 command: start
 — name: fleet.service
 command: start
 — name: 10-ens3.network
 content: |
 [Match]
 Name=enp
 [Network]
 Address=10.1.1.2/24
 Gateway=10.1.1.254
 DNS=10.1.1.254
ssh_authorized_keys:
 — ssh-rsa AAA #snip

More information: https://coreos.com/docs/cluster-management/setup/cloudinit-cloud-config/

Once you have three machines up and running, the cluster will be up and running?—?check journalctl?—?boot for any issues. Test etcd as follows:

From machine 1: etcdctl set /message123 Hello

From machine 2: etcdctl get /message123

From machine 3: etcdctl rm /message

Docker

As a basic example, the following command will launch an Ubuntu container (pulling Ubuntu from the public registry if not yet available locally) echo “hello world” and then exit.

docker run ubuntu /bin/echo hello world

Next example?—?running a bash shell from the phusion image (an Ubuntu version made more ‘docker-friendly’)

docker run -t -i phusion/baseimage /bin/bash

Again, once you exit the Ubuntu shell, the container will be destroyed. So before exiting, install a service (e.g. apache) and then commit the change:

Install apache: apt-get update && apt-get install apache2

Open another SSH session to the machine and retrieve the container’s ID (nnnnn below): docker ps

Commit the changes: docker commit nnnnn coreos/webserver

Note that ‘coreos’ is a username?—?if you want to push images to the public registry you will need to register for a proper username: https://hub.docker.com/account/login/

Instead we are going to push this image to a private registry running locally. Setting this up in another container was as easy as follows:

docker run -d -p 5000:5000 registry

Get the image tag (yyyyy below): docker images

Tag: docker tag yyyyy 10.1.1.3:5000/webserver

Push to registry: docker push 10.1.1.3:5000/webserver

Once uploaded the image can be run from another host in the cluster?—?Apache is run in the foreground to avoid the container exiting, -d is used so that the CoreOS machine is returned to a prompt, and port 80 is mapped to the container:

docker run — name webserver -d -p 80:80 10.1.1.3:5000/webserver /sbin/my_init — enable-insecure-key — quiet — /usr/sbin/apache2ctl -D FOREGROUND

Note that the insecure-key has been temporarily enabled?—?this would be replaced before the container was used in production. You should now be able to browse to the IP address of your CoreOS machine?—?port 80 will be forwarded to your Ubuntu container and you will see the Ubuntu-branded ‘It works!’ default apache page.

To SSH to the container:

Download the insecure-key: curl -o insecure_key -fSL https://github.com/phusion/baseimage-docker/raw/master/image/insecure_key && chmod 600 insecure_key

Get the container’s ID (nnnnn below): docker ps

Retrieve the container’s IP address: docker inspect -f “{{ .NetworkSettings.IPAddress }}” nnnnn

SSH: ssh -i insecure_key root@172.17.x.x

The container can be stopped with docker stop nnnnn and removed with docker rm nnnnn. docker ps -a will list inactive containers.

More information

https://coreos.com/docs/launching-containers/building/getting-started-with-docker/

https://github.com/phusion/baseimage-docker

Fleet

Rather than deploying containers manually with docker, they can be controlled at the cluster level using Fleet.

Fleet requires public-key authentication to be used for password-less SSH between cluster nodes.

The key can be created as normal with ssh-keygen and copied to the other nodes.

To avoid a ‘SSH_AUTH_SOCK environment variable is not set’ error, the SSH environment variables will need to be created: ssh-agent

Then run ssh-add

The fleetctl-inject-ssh.sh script linked below can be used to automate the key addition to the other nodes.

List the machines: fleetctl list-machines

MACHINE IP METADATA

0a670eff… 10.1.1.2 disk=ssd,provider=cloudhelix

79ae0a20… 10.1.1.3 disk=ssd,provider=cloudhelix

dfc42adb… 10.1.1.4 disk=ssd,provider=cloudhelix

The metadata we set earlier is shown. fleetctl list-machines -l will give the full machine ID, which can then be used to test SSH: fleetctl ssh id.

Sample unit file to run a webserver:

[Unit]
Description=webserver
After=docker.service
Requires=docker.service
 
[Service]
ExecStart=/usr/bin/docker run — rm — name webserver -p 80:80 10.1.1.3:5000/webserver /sbin/my_init — enable-insecure-key — quiet — /usr/sbin/apache2ctl -D FOREGROUND
ExecStop=/usr/bin/docker stop -t 1 webserver
 
[X-Fleet]
X-Conflicts=webserver..service
X-ConditionMachineMetadata=provider=cloudhelix
X-ConditionMachineMetadata=disk=ssd

Note the conditions that state webservers must run on different machines and that also specify metadata. Save this as webserver.1.service. It can then be copied to webserver.2.service with no changes required.

Here is a sidekick sample announce-webserver.2.service that will announce that the container is running?—?one is required for each container:

[Unit]
Description=Announce webserver.2
BindsTo=webserver.2.service
 
[Service]
ExecStart=/bin/sh -c “while true; do etcdctl set /services/website/webserver.2 ‘{ “host”: “%H”, “port”: 80, “version”: “insert-image-id” }’ — ttl 60;sleep 45;done”
ExecStop=/usr/bin/etcdctl rm /services/website/webserver.2
 
[X-Fleet]
X-ConditionMachineOf=webserver.2.service

This will add etcd entries to be used for service discovery?—?also note the condition that makes sure the container runs on the same machine as the webserver. If the webserver is stopped, this container will also be stopped, which will remove the entries from etcd, which will in turn be used by a loadbalancer to adjust its configuration.

The final step is to start the containers?—?wildcards can be used to start multiple containers at once:

fleetctl start webserver.

fleetctl start announce-webserver.

fleetctl list-units

UNIT STATE LOAD ACTIVE SUB DESC MACHINE

announce-webserver.1.service launched loaded active running Announce webserver.1 79ae0a20…/10.1.1.2

announce-webserver.2.service launched loaded active running Announce webserver.2 0a670eff…/10.1.1.4

webserver.1.service launched loaded active running webserver 79ae0a20…/10.1.1.2

webserver.2.service launched loaded active running webserver 0a670eff…/10.1.1.4

As shown, both webservers have been automatically placed on different machines and their service discovery containers have been automatically placed on the same machine as their webserver.

Check the status of a unit: fleetctl status unit-name

Check the logs for a unit: fleetctl journal unit-name?—?use -f to tail the logs in real-time?—?using this with one of the service discovery containers will periodically show one of its etcd additions.

Check the etcd entries for the webservers: etcdctl ls /services/?—?recursive

/services/website

/services/website/webserver.2

/services/website/webserver.1

etcdctl get /services/website/webserver.1

{ “host”: “coreos-2”, “port”: 80, “version”: “a5fbb278a609” }

etcdctl get /services/website/webserver.2

{ “host”: “coreos-4”, “port”: 80, “version”: “a5fbb278a609” }

Destroy the units:

fleetctl destroy webserver.

fleetctl destroy announce-webserver.

More information

https://coreos.com/docs/launching-containers/launching/launching-containers-fleet/

http://stackoverflow.com/questions/18880024/start-ssh-agent-on-login

https://github.com/coreos/fleet/blob/master/contrib/fleetctl-inject-ssh.sh

Conclusion

If you made it this far… congratulations! I would suggest checking out some of the CoreOS blogs for further interesting uses?—?here are some examples:

http://coreos.com/blog/zero-downtime-frontend-deploys-vulcand/

http://coreos.com/blog/docker-dynamic-ambassador-powered-by-etcd/