Skip to content
Ryan Sandor Richards edited this page Dec 4, 2015 · 7 revisions

At ease, soldier! I assume you're here because this is your first week on Point Guard rotation. Point Guards at Runnable have two primary duties:

  1. Monitor & Curate Rollbar Errors
  2. Keep Staging Running Smoothly

If that sounds like something you can't handle, don't fret! This guide is here to help guide you on your journey from Runnable point guard private through first-lieutenant and beyond!

REMEMBER: If there is a piece of information that you don't have you can always ask your neighbor, and if you find a way to improve this document you have the power! By working together we can make sure that being a point-guard is an honor, and not a foo-barred pain in the keester!

First Duty: Rollbar

All of the major Runnable services report errors via Rollbar. By doing so we get a detailed view of where the failure cases are in the infrastructure, and if they are a problem. Because we are reporting all errors, the reports can get a little noisy and may hide a real problem. Thus, the first duty of a point guard is to manage and curate what we are reporting.

In order to perform this duty you will need to do the following:

  1. Get access to Runnable's production rollbar; ask Forrest for an account.
  2. Read the Rollbar Handbook; this contains all the information you will need to know about how to manage and curate errors in rollbar.
  3. Get answers from more seasoned engineers when you are unsure of how to proceed whilst cleaning up errors.

If you follow the steps above, by the end of your first week on rotation you'll no longer be the rookie. You'll be an stone cold S-class rollbar error curator. Happy hunting, soldier.

Second Duty: Staging

The runnable staging environment is our own sandbox running our software on our product. If this seems like a lot of "ours" then you'd have picked up on the primary reason it is so important:

Dog-fooding: The runnable staging environment sandbox allows us to stay ever connected to our product and ensure it is meeting the needs of a real development team (namely ours). In other words: staging is a way for us to eat our own dog food.

As such, it is important that the staging environment always be available for use during testing and development. This leads us to the second duty of the point guard, which is three fold:

  • Monitor the staging environment and ensure it is operating smoothly
  • Report any functional issues immediately
  • Rotate and clean-up external components (such as staging docks)

In this section we will cover how to go about handling each of the tasks listed above and a general day-to-day process to help ensure you never miss a beat.

Infrastructure and Setup

  • TODO Add conceptual diagram of the current runnable staging environment
  • TODO add introduction

alpha-stage-data

The docks cannot be run within the sandbox but require consistent access to specific data-stores that can. Unfortunately because these services are running on docks, which by our definition are ephemeral, we cannot currently ensure that certain services will retain their data and IP addresses over the long term.

To remedy this situation we have two auxiliary EC2 instance called alpha-stage-data and alpha-stage-data-2. These instances run the following services:

  1. consul and vault - dock-init uses these, much easier to setup and maintain via Ansible at this time
  2. redis - data store for sauron, dock service, requires TCP
  3. rabbitmq - docker-listener etc., IP cannot switch
  4. swarm-manager - This will probably be OK to push back into the sandbox soon, but keeping it outside until things stabilize (easier to debug, etc.)

The rest of this section will give specific instructions on how to setup the data instances from scratch via ansible.

TODO: clean this section up

Consul & Vault
  1. From your local devops-scripts/ansible directory, run the following command: ansible-playbook -i stage-hosts/ consul.yml

  2. The initial runs of this playbook with fail due to docker not being fully installed on the alpha-stage-data* docks. To proceed get to the point where it fails on alpha-stage-data. At this point restart that instance in EC2 and re-run the command with -e restart=true.

  3. Now that consul is installed you will need to seed it with the data needed by dock-init, to do so run: ansible-playbook -i stage-hosts/ consul-values.yml -e write_values=yes

  4. Next we must install vault, to do so run: ansible-playbook -i stage-hosts/ vault.yml

  5. ssh alpha-stage-data

  6. sudo docker exec -it $(sudo docker ps | grep 'vault' | awk '{print $1}') sh

  7. vault init -address http://127.0.0.1:8200

  8. Record the output from the init command and set the appropriate variables in devops-scripts/ansible/stage-hosts/varibles

  9. ansible-playbook -i stage-hosts/ vault-values.yml -e write_root_creds=yes

  10. ansible-playbook -i stage-hosts/ vault-values.yml -e write_values=yes

Redis & RabbitMQ
  1. Run ansible-playbook -i stage-hosts/ redis.yml. This playbook includes the role for installing docker on the alpha-stage-data host. The first run of the playbook will intentionally fail, as the first pass needs to install packages that require a system reboot before actually installing docker and, eventually, creating a redis container.

  2. In AWS reboot the alpha-stage-data host manually.

  3. Once alpha-stage-data has rebooted run the following: ansible-playbook -i stage-hosts/ redis.yml -e restart=true. The -e restart=true indicates to the docker role that we are ready to install docker after a reboot, and the playbook should run successfully.

Swarm Manager
  1. One and done: ansible-playbook -i stage-hosts/ swarm-manager.yml

Clone this wiki locally