Giter Club home page Giter Club logo

prod's Introduction

This repo contains scripts, terraform files, and BOSH manifests/ops-files that we use for operating our infrastructure, BOSH director, and various deployments on GCP. The ops-files in this repo are used in our automated deployments running on prod Concourse.

If you ever need to modify the prod BOSH director or any infrastructure that one of the bosh deployments use then this is the repo you want to modify.

prod's People

Contributors

birdrock avatar chenbh avatar clarafu avatar dennisdenuto avatar kcmannem avatar klakin-pivotal avatar navdeep-pama avatar pivotal-bin-ju avatar pivotal-gabriel-dumitrescu avatar taylorsilva avatar topherbullock avatar vito avatar xtreme-sameer-vohra avatar youssb avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prod's Issues

continuously deploy bosh-topgun worker in main pipeline

similar to pks worker, tag everything to do with topgun or that worker appropriately.... you should be able to still use the bosh-deployment resource. the manifest for that worker is in concourse/prod, called something like topgun-worker.yml

move Strabo into separate node pool

first attempt, we increased the workers-3 node pool by 5 and allocated a new Strabo-workers deployment on to that node pool. It didn't work. @cirocosta mentioned that this doesn't actually fix things since the resources won't be isolated for Strabo.

second attempt, create a separate node pool Strabo-workers with 5 k8s workers. Move the Strabo-workers helm deployment onto it and see if the pipeline passes.

Look into automated cert rotation (and propagation)

We should take a look at whether cert rotation can be more automated (or documented better) than our current process.

Currently we need to:

  • update the certs in credhub for metrics and re-deploy
  • add the certs to the lastpass entry for the directory vars and run the create script
  • ????
  • profit

These deploys and credhub interactions could at least be assisted by a pipeline

Change Vault backend to something other than filesystem

We used to use an old version of Vault in BOSH prod, which may have used the filesystem backend as a default option. While it technically works, we may want to shift to a less error-prone storage backend because it is all of our production secrets after all, and it would be quite annoying to have to recover from someone accidentally running rm rf /vault for example while SSH'd into the VM/container.

All of the possible backends are documented here: https://www.vaultproject.io/docs/configuration/storage/index.html

We don't really need highly consistent/replicated/sharded/etc persistence so a lot of these strategies are overkill, but I'd say that the one feature we could benefit from is being able to easily make snapshots for backing up and restoring.

We already use GCS so my instinct would be to just pick that one, but this story includes room for doing some investigation.

An annoying thing we will have to do once is perform a data migration into the new schema. As far as I can tell there is no officially-supported way to migrate data between different backends.

  • Research and choose a new backend
  • Backup current prod Vault data
  • Set up the new backend
  • Update operating docs github.com/pivotal/concourse-ops wiki
  • Restore old secrets into new backend
  • Decommission old backend

Remove NATing from BOSH networks

Hey,

We've been recently receiving complaints that resources like docker-image and
registry-image have been failing with "429 Too Many Requests".

While we did introduce retries at the resource-type level for registry-image, (see
concourse/registry-image-resource#69) those using
docker-image (or trying to reach dockerhub directly) would still suffer from the
limit being place on our IP.

My hypothesis is that by removing the NAT machine that we have in the bosh
network
(which ends up making every request from any of the 40+ machines we
have going out from that single IP), we can then get rid of the problems we're
currently facing w/ regards to limits on the number of requests
(aside from
reducing one hop and a single point of failure).

Last week, I naively tried just removing the routes that we have set at the
network level

prod/iaas/bosh.tf

Lines 135 to 153 in 92cf177

resource "google_compute_route" "internal_nat" {
name = "internal-nat-route"
dest_range = "0.0.0.0/0"
network = "${google_compute_network.bosh.name}"
next_hop_instance = "${google_compute_instance.nat.name}"
next_hop_instance_zone = "${google_compute_instance.nat.zone}"
priority = 800
tags = ["internal"]
}
resource "google_compute_route" "vault_nat" {
name = "vault-nat-route"
dest_range = "0.0.0.0/0"
network = "${google_compute_network.bosh.name}"
next_hop_instance = "${google_compute_instance.nat.name}"
next_hop_instance_zone = "${google_compute_instance.nat.zone}"
priority = 800
tags = ["vault"]
}

but that didn't really work as expected as the machines that we create in the
bosh network do not assign ephemeral external IPs:

"The instance must have an external IP address. An external IP can be assigned
to an instance when it is created or after it has been created."

(from https://cloud.google.com/vpc/docs/vpc#internet_access_reqs)

- name: private
type: dynamic
subnets:
- azs: [z1, z2]
cloud_properties:
network_name: bosh
subnetwork_name: internal
tags: [internal]

Given that we're on GCP, we can overcome that by using the
ephemeral_external_ip property - see https://bosh.io/docs/google-cpi/#networks.

Should we do that? I think so - if we don't have the requirement of having those
machines completely unreachable at all (not really true in our case), I think we
should just drop it.

Thanks!

Move CI (prod) to run on K8s

We would like to migrate the CI (prod) deployment to run on K8s

TODOs v1

  • Create address & DNS entry for nci.concourse-ci.org
  • Create a worker node pool
  • Create a CloudSQL instance for CI
  • Helm deploy workers
  • Helm deploy web nodes
  • Helm deploy vault
  • Move secrets over
  • Setup auth for the vault server
  • Ensure Vault is secure Prod Deployment checklist
  • Migrate the pipelines ( choose which ones need to be migrated )
  • Migrate example pipelines used in https://concourse-ci.org/examples.html #38
  • Audit & reconcile Bosh manifest values vs. K8s Chart values
  • Move over DNS entry from nci to ci making it official #39
  • Logging solution #40
  • add X-Frame-Option #41
  • Unmute datadog Hush-House SLI alert #42
  • have baggageclaim pipeline setup

TODOS v2

move over DNS entry for `ci.concourse-ci.org`

Tentative plans:

Pre-work:

PR to concourse/prod:

PR to concourse/ci:

  • topgun worker - update configuration in concourse/ci with new DNS
  • k8s-topgun worker - update configuration in ci with new DNS
  • windows worker - update configuration in ci with new DNS
  • concourse-pipeline resource inside the reconfigure pipeline
  • slack notification resource inside the main pipeline

PR to concourse/hush-house:

Secrets:

  • TLS cert on API/web

manual followup steps:

  • terraform apply in prod
  • terraform apply in hush-house
  • deploy CI and all its worker pools, using the results of #45
  • github auth redirect url

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.