Light

concourse / prod Goto Github PK

bosh/terraform config for our deployments

Shell 31.81% HCL 68.19%

prod's Introduction

This repo contains scripts, terraform files, and BOSH manifests/ops-files that we use for operating our infrastructure, BOSH director, and various deployments on GCP. The ops-files in this repo are used in our automated deployments running on prod Concourse.

If you ever need to modify the prod BOSH director or any infrastructure that one of the bosh deployments use then this is the repo you want to modify.

prod's People

Contributors

Stargazers

Watchers

Forkers

dennisdenuto joshuatcasey birdrock klakin-pivotal isabella232

prod's Issues

continuously deploy bosh-topgun worker in main pipeline

similar to pks worker, tag everything to do with topgun or that worker appropriately.... you should be able to still use the bosh-deployment resource. the manifest for that worker is in concourse/prod, called something like topgun-worker.yml

move Strabo into separate node pool

first attempt, we increased the workers-3 node pool by 5 and allocated a new Strabo-workers deployment on to that node pool. It didn't work. @cirocosta mentioned that this doesn't actually fix things since the resources won't be isolated for Strabo.

second attempt, create a separate node pool Strabo-workers with 5 k8s workers. Move the Strabo-workers helm deployment onto it and see if the pipeline passes.

logging solution for k8s ci

add a wiki entry for how to search for logs in stackdriver - can it be as easy as papertrail?

move booklit pipeline off wings and update references in docs

https://github.com/concourse/docs/blob/master/lit/docs/observation.lit#L47-L56

migrate `examples` team pipelines to new CI

figure out where those sample pipelines come from, or invent/choose a place for them to live and tell osis. put their existence in code (like the reconfigure pipeline)

make a job that will deploy the same version of Concourse to prod and all its worker pools

in a perfect world, creds live only in vault (/its appropriately supported storage backend) and manual deploys use the same technology
there are some helm deployment resource types that might help

set up auto-unseal for vault

auto-unseal
- Understand auto-unseal with Google Cloud Key Management Service (extraEnvVars in values.yaml from the chart vault-helm and auto-unseal settings in vault config file,
- sample terraform config for gcpkms and terraform docs for gcpkms
- terraform update with adding gkms crypto and key ring.
- add gcp credential to hush-house-values-nci in lastpass.
- run command vault operator init to initialize vault.

Look into automated cert rotation (and propagation)

We should take a look at whether cert rotation can be more automated (or documented better) than our current process.

Currently we need to:

update the certs in credhub for metrics and re-deploy
add the certs to the lastpass entry for the directory vars and run the create script
????
profit

These deploys and credhub interactions could at least be assisted by a pipeline

add well-supported storage backend for vault

please put your configuration in code with terraform

Change Vault backend to something other than filesystem

We used to use an old version of Vault in BOSH prod, which may have used the filesystem backend as a default option. While it technically works, we may want to shift to a less error-prone storage backend because it is all of our production secrets after all, and it would be quite annoying to have to recover from someone accidentally running rm rf /vault for example while SSH'd into the VM/container.

All of the possible backends are documented here: https://www.vaultproject.io/docs/configuration/storage/index.html

We don't really need highly consistent/replicated/sharded/etc persistence so a lot of these strategies are overkill, but I'd say that the one feature we could benefit from is being able to easily make snapshots for backing up and restoring.

We already use GCS so my instinct would be to just pick that one, but this story includes room for doing some investigation.

An annoying thing we will have to do once is perform a data migration into the new schema. As far as I can tell there is no officially-supported way to migrate data between different backends.

Research and choose a new backend
Backup current prod Vault data
Set up the new backend
Update operating docs github.com/pivotal/concourse-ops wiki
Restore old secrets into new backend
Decommission old backend

Remove NATing from BOSH networks

Hey,

We've been recently receiving complaints that resources like docker-image and
registry-image have been failing with "429 Too Many Requests".

While we did introduce retries at the resource-type level for registry-image, (see
concourse/registry-image-resource#69) those using
docker-image (or trying to reach dockerhub directly) would still suffer from the
limit being place on our IP.

My hypothesis is that by removing the NAT machine that we have in the bosh
network (which ends up making every request from any of the 40+ machines we
have going out from that single IP), we can then get rid of the problems we're
currently facing w/ regards to limits on the number of requests (aside from
reducing one hop and a single point of failure).

Last week, I naively tried just removing the routes that we have set at the
network level

prod/iaas/bosh.tf

Lines 135 to 153 in 92cf177

 resource "google_compute_route" "internal_nat" { 

 name = "internal-nat-route" 

 dest_range = "0.0.0.0/0" 

 network = "${google_compute_network.bosh.name}" 

 next_hop_instance = "${google_compute_instance.nat.name}" 

 next_hop_instance_zone = "${google_compute_instance.nat.zone}" 

 priority = 800 

 tags = ["internal"] 

 } 

 resource "google_compute_route" "vault_nat" { 

 name = "vault-nat-route" 

 dest_range = "0.0.0.0/0" 

 network = "${google_compute_network.bosh.name}" 

 next_hop_instance = "${google_compute_instance.nat.name}" 

 next_hop_instance_zone = "${google_compute_instance.nat.zone}" 

 priority = 800 

 tags = ["vault"] 

 }

but that didn't really work as expected as the machines that we create in the
bosh network do not assign ephemeral external IPs:

"The instance must have an external IP address. An external IP can be assigned
to an instance when it is created or after it has been created."

(from https://cloud.google.com/vpc/docs/vpc#internet_access_reqs)

prod/bosh/cloud_config.yml

Lines 29 to 36 in 92cf177

 - name: private 

 type: dynamic 

 subnets: 

 - azs: [z1, z2] 

 cloud_properties: 

 network_name: bosh 

 subnetwork_name: internal 

 tags: [internal]

Given that we're on GCP, we can overcome that by using the
ephemeral_external_ip property - see https://bosh.io/docs/google-cpi/#networks.

Should we do that? I think so - if we don't have the requirement of having those
machines completely unreachable at all (not really true in our case), I think we
should just drop it.

Thanks!

Move CI (prod) to run on K8s

We would like to migrate the CI (prod) deployment to run on K8s

TODOs v1

TODOS v2

merge concourse/ci#231
merge concourse/hush-house#90
Figure out prod deploys: #45
- windows worker on prod should move its config to ci

PR to concourse/prod:

remove ci-concourse-ci-org-dns from https://github.com/concourse/prod/blob/master/iaas/dns.tf

PR to concourse/ci:

topgun worker - update configuration in concourse/ci with new DNS
k8s-topgun worker - update configuration in ci with new DNS
windows worker - update configuration in ci with new DNS
concourse-pipeline resource inside the reconfigure pipeline
slack notification resource inside the main pipeline

PR to concourse/hush-house:

change concourse-nci-address in https://github.com/concourse/hush-house/blob/master/terraform/main.tf
external url on the new environment needs to change too
change cluster name from ci-house to ci

Secrets:

TLS cert on API/web

manual followup steps:

terraform apply in prod
terraform apply in hush-house
deploy CI and all its worker pools, using the results of #45
github auth redirect url

scale down old prod

makes sense to do at the same time as #39

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	resource "google_compute_route" "internal_nat" {
	name = "internal-nat-route"
	dest_range = "0.0.0.0/0"
	network = "${google_compute_network.bosh.name}"
	next_hop_instance = "${google_compute_instance.nat.name}"
	next_hop_instance_zone = "${google_compute_instance.nat.zone}"
	priority = 800
	tags = ["internal"]
	}

	resource "google_compute_route" "vault_nat" {
	name = "vault-nat-route"
	dest_range = "0.0.0.0/0"
	network = "${google_compute_network.bosh.name}"
	next_hop_instance = "${google_compute_instance.nat.name}"
	next_hop_instance_zone = "${google_compute_instance.nat.zone}"
	priority = 800
	tags = ["vault"]
	}

	- name: private
	type: dynamic
	subnets:
	- azs: [z1, z2]
	cloud_properties:
	network_name: bosh
	subnetwork_name: internal
	tags: [internal]