Giter Club home page Giter Club logo

buildkite / elastic-ci-stack-for-aws Goto Github PK

View Code? Open in Web Editor NEW
414.0 35.0 265.0 3.98 MB

An auto-scaling cluster of build agents running in your own AWS VPC

Home Page: https://buildkite.com/docs/quickstart/elastic-ci-stack-aws

License: MIT License

Shell 58.51% Makefile 6.54% PowerShell 18.71% Dockerfile 0.37% Go 11.32% HCL 4.55%
buildkite docker continuous-integration continuous-deployment continuous-delivery aws buildkite-agent-orchestration

elastic-ci-stack-for-aws's Introduction

Elastic CI Stack for AWS

Build status

Buildkite Elastic CI Stack for AWS

Buildkite is a platform for running fast, secure, and scalable continuous integration pipelines on your own infrastructure.

The Buildkite Elastic CI Stack for AWS gives you a private, autoscaling Buildkite Agent cluster. Use it to parallelize large test suites across thousands of nodes, run tests and deployments for Linux or Windows based services and apps, or run AWS ops tasks.

Getting started

See the Elastic CI Stack for AWS tutorial for a step-by-step guide, the Elastic CI Stack for AWS documentation, or the full list of recommended resources for detailed information.

Or jump straight in:

Launch AWS Stack

The current release is . See Releases for older releases.

Although the stack creates its own VPC by default, we highly recommend following best practice by setting up a separate development AWS account and using role switching and consolidated billing — see the Delegate Access Across AWS Accounts tutorial for more information.

If you want to use the AWS CLI, download config.json.example, rename it to config.json, and then run the below command:

aws cloudformation create-stack \
  --output text \
  --stack-name buildkite \
  --template-url "https://s3.amazonaws.com/buildkite-aws-stack/latest/aws-stack.yml" \
  --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
  --parameters "$(cat config.json)"

Supported Features

Most features are supported across both Linux and Windows. See below for details of per-operating system support:

Feature Linux Windows
Docker
Docker Compose
AWS CLI
S3 Secrets Bucket
ECR Login
Docker Login
CloudWatch Logs Agent
Per-Instance Bootstrap Script
SSM Access
Instance Storage (NVMe)
SSH Access
Periodic authorized_keys Refresh
Periodic Instance Health Check
git lfs
Additional sudo Permissions
RDP Access

Security

This repository hasn't been reviewed by security researchers so exercise caution and careful thought with what credentials you make available to your builds.

Anyone with commit access to your codebase (including third-party pull-requests if you've enabled them in Buildkite) will have access to your secrets bucket files.

Also keep in mind the EC2 HTTP metadata server is available from within builds, which means builds act with the same IAM permissions as the instance.

Limiting CloudFormation Permissions

By default, CloudFormation will operate using the permissions granted to the identity of the credentials used to initiate a stack deployment or update.

If you want to explicitly specify which actions CloudFormation can perform on your behalf, you can either create your stack using credentials for an IAM identity with limited permissions, or provide an AWS CloudFormation service role.

🧑‍🔬 templates/service-role.yml template contains an experimental service role and set of IAM Policies that list the IAM Actions necessary to create, update, and delete a CloudFormation Stack created with the Buildkite Elastic CI Stack template. The role created by this template is currently being tested, but it has not been tested enough to be depended on. There are likely to be missing permissions for some stack parameter permutations.

aws cloudformation deploy --template-file templates/service-role.yml --stack-name buildkite-elastic-ci-stack-service-role --region us-east-1 --capabilities CAPABILITY_IAM

Development

To get started with customizing your own stack, or contributing fixes and features:

# Checkout all submodules
git submodule update --init --recursive

# Build all AMIs and render a cloud formation template - this requires AWS credentials (in the ENV)
# to build an AMI with packer
make build

# To create a new stack on AWS using the local template
make create-stack

# You can use any of the AWS* environment variables that the aws-cli supports
AWS_PROFILE="some-profile" make create-stack

# You can also use aws-vault or similar
aws-vault exec some-profile -- make create-stack

If you need to build your own AMI (because you've changed something in the packer directory), run packer with AWS credentials in your shell environment:

make packer

This will boot and image three AWS EC2 instances in your account’s us-east-1 default VPC:

  • Linux (64-bit x86)
  • Linux (64-bit Arm)
  • Windows (64-bit x86)

Support Policy

We provide support for security and bug fixes on the current major release only.

If there are any changes in the main branch since the last tagged release, we aim to publish a new tagged release of this template at the end of each month.

AWS Regions

We support all AWS Regions, except China and US GovCloud.

We aim to support new regions within one month of general availability.

Operating Systems

We build and deploy the following AMIs to all our supported regions:

  • Amazon Linux 2023 (64-bit x86)
  • Amazon Linux 2023 (64-bit Arm)
  • Windows Server 2019 (64-bit x86)

Buildkite Agent

The Elastic CI Stack template published from the main branch tracks the latest Buildkite Agent release.

You may wish to preview any updates to your stack from this template using a CloudFormation Stack Change Set to decide whether to apply it.

Recommended reading

To gain a better understanding of how Elastic CI Stack works and how to use it most effectively and securely, see the following resources:

Questions and support

Feel free to drop an email to [email protected] with questions. It helps us if you can provide the following details:

# List your stack parameters
aws cloudformation describe-stacks --stack-name MY_STACK_NAME \
  --query 'Stacks[].Parameters[].[ParameterKey,ParameterValue]' --output table

Collect logs from CloudWatch

Provide us with logs from CloudWatch Logs:

/buildkite/elastic-stack/{instance-id}
/buildkite/system/{instance-id}

Collect logs via script

An alternative method to collect the logs is to use the log-collector script in the utils folder. The script will collect CloudWatch logs for the Instance, Lambda function, and AutoScaling activity and package them in a zip archive which you can send via email to [email protected].

You can also visit our Forum and post a question in the Elastic CI Stack for AWS section!

Licence

See Licence.md (MIT)

elastic-ci-stack-for-aws's People

Contributors

123sarahj123 avatar chloeruka avatar drjosh9000 avatar eleanorakh avatar exidy avatar galak avatar jeremiahsnapp avatar jeremybumsted avatar joffotron avatar juanitofatas avatar karensawrey avatar keithduncan avatar lox avatar moskyb avatar nitrocode avatar orien avatar patrobinson avatar pda avatar pvdvreede avatar rianmcguire avatar sj26 avatar starfallprojects avatar tduffield avatar tessereth avatar toolmantim avatar toothbrush avatar triarius avatar wolfeidau avatar yob avatar zsims avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elastic-ci-stack-for-aws's Issues

Cannot use docker and multiple buildkite-agents

Hi,

I have discovered that it is currently impossible to reliably run multiple buildkite-agents with docker on the same machine. If a couple of tasks start up shortly after each other, the following lines in the buildkite "environment" hook will lead to running docker-containers (from earlier tasks) being killed... In the build output this leads to failed tasks with exit status 137.

if [[ $(docker ps -q | wc -l) -gt 0 ]] ; then
  echo "~~~ Stopping existing :docker: containers"
  docker stop $(docker ps -q)
fi

I guess solutions would be to either rely on build-scripts to clean up after themselves (i.e. no cleanup in the environment hook at all), or to scope the killing to the specific buildkite-agent somehow...

Adding a password-less SSH key fails

I have a put a password-less key in my secret bucket. The file is correctly downloaded from S3, but adding the key fails. Here's my build log.

Agent pid 2946
2016-08-04 08:01:33    1.6 KiB private_ssh_key
Downloading ssh key from s3://ss-bk-elastic-secrets-us-east-1/private_ssh_key
Enter passphrase for /dev/fd/63:

It hangs waiting for standard in.

Set KMS Download of Secrets to be the default approach

Now that we have KMS integration into the BuildKite agents for secret management I would like to have that as the default way of copying from the S3 secrets bucket instead of having to specify the environment variable BUILDKITE_USE_KMS.

This implies that by default (without having to configure pipelines) BuildKite agents would attempt to decrypt the secrets from S3 using KMS, we could still allow the use of PASSPHRASE for encryption with the same approach it has now (if the ENV variable exists than use it).

Happy to create the pull request if you would like.

Intermittent docker error "could not find an available predefined network"

I see these on builds occasionally:

$ docker-compose -f docker-compose.yml -p buildkite9c0e505a3f5844e4bde33ee36c928627 run node pug-lint app/assets/javascripts
Creating network "buildkite9c0e505a3f5844e4bde33ee36c928627_default" with the default driver
ERROR: failed to parse pool request for address space "LocalDefault" pool "" subpool "": could not find an available predefined network

Here's the Dockerfile I'm using:

FROM node:4.5.0

RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

COPY package.json /usr/src/app/

RUN npm install
RUN npm install -g eslint babel-eslint pug-lint

COPY . /usr/src/app

and the pipeline.yml:

steps:
  - name: ":docker:"
    plugins:
      docker-compose:
        build: node

  - wait

  - name: ":eslint:"
    command: "eslint --cache app/assets/javascripts spec/javascripts"
    plugins:
      docker-compose:
        run: node

  - name: ":dog:"
    command: "pug-lint app/assets/javascripts"
    plugins:
      docker-compose:
        run: node

Agent instances have public IPs

It's not a major issue, but since I deploy my BK agents in a private subnet with NAT gateways, public IPs are unnecessary (and potentially confusing to anyone looking at the agents). Perhaps a configuration option as a stack parameter?

Scale down based on idle workers, not number of scheduled jobs

Slack comment

My understanding of the current scale down policy is that scaling down on ScheduledJobsCount <= 0 means that even if all workers are 100% busy running jobs, that some may still get killed.

I'm thinking that makes more sense to tie scaling down to the # of idle agents (perhaps being above 0 for X number of minutes).

ECR Authentication failures should fail builds

E.g

An error occurred (AccessDeniedException) when calling the GetAuthorizationToken operation: User: arn:aws:sts::sdfsdfsd:assumed-role/buildkite-docker-builders-Role/i-sdfsdf is not authorized to perform: ecr:GetAuthorizationToken on resource: *

Ability to customize gc-docker rules

At the moment the hourly cleaning of Docker images is a bummer for caches… especially on our builder machine. I'd be neat to support a S3 file containing the gc-docker rules… or perhaps just paste them in as a stack parameter?

Blocked builds if secret environment variables incorrect

The BUILDKITE_SECRETS_KEY is required to get objects from the secrets bucket. However if you have it set incorrectly builds will block when adding an SSH key. The build blocks waiting for input on STDIN. Is there some way we can prevent this so builds fail?

Publicly publish each stack version

Reverting back to old versions of the stack is a little tricky. In #138 at least there's a URL outside of Buildkite that people can use. But it'd be nicer if there was a public list of releases where you could download the stack JSON file.

One idea would be to have a <major-ver>.<auto-incrementing-number> auto-tagging + release step as part of publishing, so you could go to GitHub releases and find previous releases of the stack. The major version could be checked into the repo, the minor version could just be the Buildkite build number?

Autoscaling policies can livelock at 0 runners

When using the job based autoscaling, you can get into a (common) situation where there aren't any executors (so RunningJobsCount == 0), and there are queued jobs (so ScheduledJobsCount > 0). Then whenever the scale up hook adds a server, in the time it takes for the server to boot, the scale down hook kills it.

AWS ECR login creds are clobbered by pre-command

At the moment there's no way to login to ECR to use the docker-compose build/push support.

If you add an env hook that performs:

$(aws ecr get-login --region us-west-2)

then because of our pre-command hook the creds get removed before any commands are run.

The pre-command hook basically undoes the env hook.

I propose we:

  • Don't rm the docker config files (but if we must keep it, at least move it into the conditional)
  • Add built-in support for AWS_ECR_LOGIN=<region> which performs the login for you

We could also add support for logins in https://github.com/buildkite-plugins/docker-compose-buildkite-plugin itself?

Using a prefix for the secrets bucket doesn't work

It ends up missing the bucket-level permission to list objects. That is, given a prefix like foo/bar, we need IAM permissions for: arn:aws:s3:::foo/bar/* & arn:aws:s3:::foo, but the IAM config misses the second one.

Termination policy

The termination policy is currently Default, however I think it should be ClosestToNextInstanceHour- to get the best $ utilisation for when we scale down. The Default policy cares about AZs and launch configurations which probably aren't of concern to ephemeral CI instances

Instance should be marked as unhealthy if buildkite-agent it stopped

I think you were right on this one @lox — seeing as there's no monit/upstart, if you remotely stop the buildkite-agent process then you're going to get the instance just sitting there, but without any agent running on it.

I imagine what should happen is that the instance is marked as unhealthy and then replaced as soon as that happens.

Adding SSH key from S3 hangs in console

I am having an issue similar to #109 where the ssh key downloads correctly from the default secrets s3 bucket, but then is hanging on user input when adding the key. My build output is below. I have also tried setting BUILDKITE_SECRETS_KEY and BUILDKITE_USE_KMS and it has given the same result.

Sourcing CloudFormation environment...
 
Starting an SSH Agent...
Agent pid 2940
 
Downloading secrets...
 
2016-11-09 09:15:29    2.1 KiB private_ssh_key
Downloading ssh key from s3://signpost-buildkite-trial-secrets/private_ssh_key
Enter passphrase for /dev/fd/63:

Thanks for your help.

Matt

RFC: Improving cold cache for large no. of parallel jobs w/ autoscaled instances

If you have a Rails app that takes a long time to do a docker build on a fresh machine, and a pipeline with 50+ parallel spec steps, and an auto-scaling group that goes from 0 to x, having cold Docker caches can be a total bummer. This is because each of the test steps ends up running docker-compose build which can take a huge amount of time for non-trivial apps… basically creating these massive outlier build times that developers would tear that hair out over.

Now you can use the beta buildkite agent and https://github.com/buildkite/docker-compose-buildkite-plugin you're able to add a "build" step at the start of the pipeline and have all subsequent steps pick up this built docker image, instead of having to build it again on all your machines. This is very similar to how most people would structure their own pipelines anyhow, and it means you end up pushing a Docker image somewhere for each build which is also nice for later debugging. And it works with the docker hub credentials support that's already built into this stack. For example:

- command: build-and-push.sh
- wait
- command: parallel_spec.sh
  name: "Spec %n"
  parallelism: 50

The way I'm doing this at the moment is by creating two separate CloudFormation stacks: buildkite-builders and buildkite-runners. buildkite-builders is a t2.medium with a min size of 1 and max size of 2 and scale-in/scale-out of 0 to disable auto-scaling (you can't have it min 1 max 1, or else stack updates fail because of the stack update policy that requires at least one instance to be in service). buildkite-runners is a c4.medium with spot pricing, min size of 0, max size of 50, scale-out/in of +-25. This setup means you get an always-on warm cache for building+pushing, and relatively fast spin up for runners for the parallel tests (fast in both auto-scaling, and in terms starting the job on a new instance). This keeps most builds < 5 minutes if there's enough agents, or an extra few minutes if new agents need to be spun up.

I think this sort of thing is a pretty common pattern, and it'd be nice to build this into the stack somehow.

One idea would be to enable you to set a fixed size of "builders" instances which stay on at all times, which end up with their own queue which you can target using the new docker plugin. It could be that for a given stack you end up with {stack-name}-builders and {stack-name}-runners. There would be no auto-scaling of builders (but they could still be managed via an ASG). If we built this into the own stack, you'd end up with parameters to configure the two different ASGs, builders and runners.

docker ps hanging

It appears that a first job can execute before other services are fully ready, with the result that the S3 cp that happens in environment hook just infinitely hangs (perhaps the instance meta-data store isn't yet available)

We should check the init.d run levels:
http://stackoverflow.com/a/16159642

EBS gp2 doesn't support provisioned IOPS

The template tried to provision IOPS - which is doing nothing, since it specifies general purpose SSD (gp2) which doesn't support provisioned IOPS. Did you intend for a "volume type" argument to go with this?

Better cloudformation stack updates

Right now updating a cloudformation stack is quite buggy and (for me) almost always fails. Even something as simple as updating the spot price for instances is not possible.

As a workaround, we end up creating a new stack and deleting the old one. This is error prone, requires access to API keys, results in failed builds, and is time consuming.

Preferably the process for the cloudformation updates is not as buggy. Unfortunately I have no recommendations on how to improve this at this time.

Updating stack parameters doesn't trigger a rolling update

If you update the buildkite agent queue name, API key, agents per instance or release stream (from stable to beta) you need to manually set the ASG desired size to 0 and force a scale down for it to apply changes, because it's all done in cloud-init. And some people might try just deleting the stack and recreating.

It'd be preferable if you could just update the stack parameters and it's handled for you. This could mean performing a rolling update somehow?

Cloudformation rolls back creation in `us-east-1c`

This mightn't be an issue with this repo, but when I try and spin up a cluster in N. Virginia, it consistently gets rolled back.

AWS::EC2::Subnet
Subnet2
Value (us-east-1c) for parameter availabilityZone is invalid. Subnets can currently only be created in the following availability zones: us-east-1e, us-east-1a, us-east-1d, us-east-1b.

Googling around looks like it's a common problem but mightn't have any immediate outcomes. Perhaps some documentation on what to do if people run into this scenario?

Error changing ownership with chown and multiple agents

Because https://github.com/buildkite/elastic-ci-stack-for-aws/blob/418fa67a948c0a86b318ff7dc94d4750510201a1/packer/conf/buildkite-agent/scripts/fix-buildkite-agent-builds-permissions is run within each build, but is run from every build's environment hook, you can get nasty clashes resulting in:

chown: changing ownership of '/var/lib/buildkite-agent/builds/buildkite-i-cd2747fc-8/myorg/myapp/buildkite-script-5a7f015d-4583-4bfd-8371-e947b35437c8': No such file or directory

For example, this was a job on buildkite-i-cd2747fc-10 trying to chown a job running on buildkite-i-cd2747fc-8.

This should be scoped to $BUILDKITE_BUILD_CHECKOUT_PATH

Stack creation fails with errors about subnet availability

Sometimes stack creation fails with:

Value (us-east-1a) for parameter AvailabilityZones is invalid. Subnets can currently only be created in the following availability zones...

image

This is because if no AvailabilityZones is provided, the list of all availability zones is used, but only of these can have subnets created. Because each account is different (the availability zones are randomly named), a general solution is very difficult.

https://stackoverflow.com/questions/21390444/is-there-a-way-for-cloudformation-to-query-available-zones-for-subnet-creation

new release?

Can you guys release the new version with global env support?

Thanks

Add custom agent parameter customisation

Love the set up. One thing I was hoping to do was add some extra parameters to the cloud formation template to pass though to the individual buildkite agents, for consistency with our existing agents.

eg. we user an os=osx and os=linux agent parameter.

We could deduce this from the existing parameters you create for us, but it'd be great to tweak them as well.

I've tweaked the generated CF JSON to get this work, but I'm not sure where to edit the original templates to get this working. I assume around here in the buildkite-elastic.yml

I'm happy to have a crack if there's an easy way to test I haven't broken anything... on condition that I get to add devops to my LinkedIn ;)

Stepped scaling support

For our application our main app requires 27 agents, but other pipelines only require 1 or 2. The only way we can make the stack responsive at the start of the day is to make the scaling policy +27 -27 scale out/in, but it's a waste of $ to do that for just a couple of scheduled jobs.

It'd be much better if we could use stepped scaling, and have it scale in/out in smaller increments

Reducing wait time for cold caches with large apps

Currently the best way to handle a large app that takes ages to build is to have two separate instances of the stack: builders and runners, and https://github.com/buildkite-plugins/docker-compose-buildkite-plugin to have a build step that runs before the steps.

The builder doesn't autoscale, sits there with a warm docker cache, and pushes to a registry. The test steps run on the runners queue, and pulls from docker hub without doing any git clone at all.

The pipeline looks like this:

steps:

  - name: ":docker: :package:"
    plugins:
      docker-compose:
        build: app
        image-repository: index.docker.io/myorg/myapp
    agents:
      queue: elastic-builders

  - wait

  - name: ":spec: %n"
    command: ".buildkite/steps/knapsack"
    plugins:
      docker-compose:
        run: app
    agents:
      queue: elastic-runners
    parallelism: 75

A simpler setup would be to use something like http://blog.runnable.com/post/145362675491/distributing-docker-cache-across-hosts to make the cache mostly invisible, and require no separate queue or pipeline config changes.

Using with ECR isn't clear

A few bugs / tough things I've encountered:

  • "Must specify a region" when using AWS_ECR_LOGIN=true. I got around this by force setting the env var AWS_DEFAULT_REGION.
  • ECR Permissions. I continually get:
An error occurred (AccessDeniedException) when calling the GetAuthorizationToken operation: User: arn:aws:sts::053722134931:assumed-role/buildkite-IAMRole-XXX is not authorized to perform: ecr:GetAuthorizationToken on resource: *

I think either more documentation around ECR or having the CloudFormation script add an ECR would make this way easier.

Change bootstrap script to pull from s3 or curl

Hi there,

I notice that the AuthorziedSshUsers parameter can take a url or an s3 key, and the cfn-init script checks if it starts with s3:\\ to determine if its a curl or a aws s3 cp. Can this same logic be applied to the bootstrap script as well? I want to run a script and would put it in s3 but use an instance profile to allow access to it...

Many thanks.

agent failing on ecr get-login

Especially in the morning, you may see: ERROR: denied: Your Authorization Token has expired. Please run 'aws ecr get-login' to fetch a new one. When this happens, it's an agent that has been running for more than a day.

Here are some excerpts from a build that had this happen. I modified it and removed parts so let me know if this is sufficient (removed any identifying characteristics of what this job is doing).

$ /etc/buildkite-agent/hooks/environment
Setting up the environment	1s
Sourcing CloudFormation environment...
 
Starting an SSH Agent...
Agent pid 3354

Running global pre-command hook	0s
$ /etc/buildkite-agent/hooks/pre-command
Running build script	1s
$ docker-compose -f docker-compose.yml down
docker-compose -f docker-compose.yml run our_command bash -c "command is here"
docker-compose -f docker-compose.yml down
Removing network daily_default
WARNING: Network daily_default not found.
Creating network "daily_default" with the default driver
Creating daily-server_1
Pulling container_name (container details)...
ERROR: denied: Your Authorization Token has expired. Please run 'aws ecr get-login' to fetch a new one.

Smarter autoscaling with multiple agents per instance

We are using aws ci stack with multiple agents per instance. It looks like the buildkite scheduler picks (seemingly) random agents from the pool of instances when picking containers for builds. This seems to result in the auto-scaler having more capacity than needed because even if we are only using 20 agents, they are spread out across all 10 instances, each instance having 2 agents being used.

Preferably the scheduler would pick agents from the instances that are already being used so that the other instances can scale down and save $$.

Document adding an SSH key into secrets bucket in Getting Started

Right now, you must define your SecretsBucket as part of the CloudFormation configuration and only then are you able to override it with an environment variable.

It would be great to be able to set this using only environment variables, so that I could use the default CloudFormation stack and only need to use buildkite to configure this.

Managed Policies no longer work after ECR changes

It appears that #167 has introduced a bug to and now managed policies no longer work.

Stack updates with a comma delimited list now produce this error:
Value of property ManagedPolicyArns must be of type List of String

I have a copy of a prior stack and the diff between them looks like this:

$ diff -uN stack.json buildkite-stack.json
--- stack.json	2016-11-18 14:37:18.000000000 +1100
+++ buildkite-stack.json	2016-11-22 09:06:49.000000000 +1100
@@ -113,17 +113,36 @@
     "IAMRole": {
       "Type": "AWS::IAM::Role",
       "Properties": {
-        "ManagedPolicyArns": {
-          "Fn::If": [
-            "UseSpecifiedIamPolicies",
-            {
-              "Ref": "ManagedPolicyArns"
-            },
-            {
-              "Ref": "AWS::NoValue"
-            }
-          ]
-        },
+        "ManagedPolicyArns": [
+          {
+            "Fn::If": [
+              "UseSpecifiedIamPolicies",
+              {
+                "Ref": "ManagedPolicyArns"
+              },
+              {
+                "Ref": "AWS::NoValue"
+              }
+            ]
+          },
+          {
+            "Fn::If": [
+              "UseECR",
+              {
+                "Fn::FindInMap": [
+                  "ECRManagedPolicy",
+                  {
+                    "Ref": "ECRAccessPolicy"
+                  },
+                  "Policy"
+                ]
+              },
+              {
+                "Ref": "AWS::NoValue"
+              }
+            ]
+          }
+        ],

This looks to be the cause:
https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/templates/buildkite-elastic.yml#L242

ManagedPolicies changes from a hash to an array.

cc:
@lox

lifecycle termination may not be working correctly

It's possible this will be resolved by #57, but... I also have some instances which are no longer running the Buildkite agent, but are still part of the ASG. I found one of these and grabbed this from its logs:

May  6 17:31:51 ip-10-0-1-24 lifecycled: #033[34m  INFO#033[0m[0368] Received message          #033[34minstanceid#033[0m=i-8e00f214 #033[34mtransition#033[0m=autoscaling:EC2_INSTANCE_TERMINATING
May  6 17:31:51 ip-10-0-1-24 lifecycled: #033[34m  INFO#033[0m[0368] Executing handler         #033[34mhandler#033[0m=/usr/bin/buildkite-lifecycle-handler #033[34minstanceid#033[0m=i-8e00f214 #033[34mtransition#033[0m=autoscaling:EC2_INSTANCE_TERMINATING
May  6 17:31:51 ip-10-0-1-24 lifecycled: stopping buildkite-agent gracefully
May  6 17:31:52 ip-10-0-1-24 lifecycled: Stopping buildkite-agent...
May  6 17:31:52 ip-10-0-1-24 lifecycled: Stopped
May  6 17:31:52 ip-10-0-1-24 lifecycled: buildkite-agent stopped!
May  6 17:31:52 ip-10-0-1-24 lifecycled: #033[34m  INFO#033[0m[0370] Handler finished successfully #033[34mhandler#033[0m=/usr/bin/buildkite-lifecycle-handler #033[34minstanceid#033[0m=i-8e00f214 #033[34mtransition#033[0m=autoscaling:EC2_INSTANCE_TERMINATING
May  6 17:31:52 ip-10-0-1-24 lifecycled: #033[34m  INFO#033[0m[0370] Lifecycle action completed successfully #033[34minstanceid#033[0m=i-8e00f214 #033[34mtransition#033[0m=autoscaling:EC2_INSTANCE_TERMINATING

Looks like it thinks it terminated correctly?

cfn-signal doesn't exist

buildkite-elastic.yml includes this in the UserData script:

/usr/local/bin/cfn-signal -e \$? -r 'cfn-init finished' \
  --stack $(AWS::StackName) --resource 'AgentAutoScaleGroup' --region $(AWS::Region)

However, /usr/local/bin/cfn-signal doesn't exist:

Cloud-init v. 0.7.7 running 'modules:config' at Tue, 07 Jul 2015 00:56:00 +0000. Up 50.92 seconds.
#!/bin/bash -xv
/usr/local/bin/cfn-init -s arn:aws:cloudformation:us-east-1:349538293217:stack/ci-20150706-175140/55799120-2442-11e5-b551-50fa1dbb2c64 -r AgentLaunchConfiguration --region us-east-1
+ /usr/local/bin/cfn-init -s arn:aws:cloudformation:us-east-1:349538293217:stack/ci-20150706-175140/55799120-2442-11e5-b551-50fa1dbb2c64 -r AgentLaunchConfiguration --region us-east-1
/var/lib/cloud/instance/scripts/part-001: line 2: /usr/local/bin/cfn-init: No such file or directory
/usr/local/bin/cfn-signal -e $? -r 'cfn-init finished' \
  --stack ci-20150706-175140 --resource 'AgentAutoScaleGroup' --region us-east-1
+ /usr/local/bin/cfn-signal -e 127 -r 'cfn-init finished' --stack ci-20150706-175140 --resource AgentAutoScaleGroup --region us-east-1
/var/lib/cloud/instance/scripts/part-001: line 3: /usr/local/bin/cfn-signal: No such file or directory
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=15.04
DISTRIB_CODENAME=vivid
DISTRIB_DESCRIPTION="Ubuntu 15.04"
$ curl http://169.254.169.254/latest/meta-data/ami-id ; echo
ami-55986a3e

VpcId isn't propagated to the MetricsStack

When AWS Stack goes to create the MetricsStack instance, it uses the Default VPC, instead of the VpcId defined in the properties (or the VPC is creates for you). If the AWS Account doesn't have a Default VPC, the cloud formation script exits with a cryptic error.

pasted_image_at_2016_04_18_03_30_pm

Thanks @ryanbigg for helping us find this problem!

Environment variable BUILDKITE_SECRETS_BUCKET doesn't override the default secret bucket name

I tried to set BUILDKITE_SECRETS_BUCKET with a new bucket name "buildkite-pipelines-secrets" but apparently the old bucket name (SecretsBucket property set in cloudformation) is still used by the pipeline to look for a private key. This can be verified in the pipeline log and environment variables.

  1. click the failed pipeline

  2. select the Log tab. The log shows
    "Downloading secrets...
    No SSH keys found in s3://{old bucket name}"

  3. go to the "Environment" tab, find that
    BUILDKITE_SECRETS_BUCKET="buildkite-pipelines-secrets"

Suggestion for smoother autoscaling

The current rules seem to give a hard scale up to max, and a hard scale down to min when idle. This might fit a typical workday in one office, but because we have a few developers in different timezones using the elastic stack, this doesn't really fit because we have a lot of unused resources when this hard scale up occurs in the middle of the night. I'd prefer a more reactive stack that could scale to fit the demand at the time.

Typically the way I've used autoscaling is to peg average CPU utilisation, adding instances at 60%, and removing instances at 40%. Agents are a little different as we've literally got a queue being processed, which means our stack stats are very spiky. Builds that come in spawn multiple jobs, so the RunningJobsCount and UnfinishedJobs bounces around quite a bit.

So the metric I think will give us better results would be a "agents busy" percentage averaged over 5 minutes or so. Then we could trigger a more gentle scale up maybe at 70%, and a gentle scale down at maybe 20%. Hopefully that would give the elastic stack room to bounce around between 20-70%, but will nudge the cluster size in the right direction once utilisation increases or decreases past those points. And we would keep the existing scale up trigger if the ScheduledJobsCount rises above 0 to make sure we scale in the first place and on demand.

Thoughts @toolmantim @lox ?

Scale down doesn't let jobs run gracefully

When a scale down happens I'm finding that a job is aborting mid-test-run with ERROR: Couldn't connect to Docker daemon at http+docker://localunixsocket - is it running?, and the last thing in the buildkite-agent CloudWatch logs is "Gracefully shutting down" but no more logs, but the job successfully uploads artifacts etc after failing, and gracefully shuts down via our API, but is missing any more log lines in CloudWatch.

My guess is that we're not preventing awslogs and docker daemons being shutdown in the autoscale terminating event. lifecycled appears only to be helping buildkite-agent from being shut down.

Rolling update fails with "Failed to receive 5 resource signal(s)"

When performing a stack update with running instances it's very common for it to fail, making for a long 50 minutes. #126 was meant to help this, but it appears not really to help at all, but only to make the wait longer.

The stack event log:

21:07:41 UTC+10 UPDATE_ROLLBACK_COMPLETE  AWS::CloudFormation::Stack  elastic-runners 
21:07:41 UTC+10 UPDATE_COMPLETE AWS::CloudFormation::Stack  MetricsStack  
21:07:30 UTC+10 DELETE_COMPLETE AWS::AutoScaling::LaunchConfiguration AgentLaunchConfiguration  
21:07:29 UTC+10 UPDATE_IN_PROGRESS  AWS::CloudFormation::Stack  MetricsStack  
21:07:29 UTC+10 DELETE_IN_PROGRESS  AWS::AutoScaling::LaunchConfiguration AgentLaunchConfiguration  
21:07:26 UTC+10 UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS  AWS::CloudFormation::Stack  elastic-runners 
21:07:25 UTC+10 UPDATE_COMPLETE AWS::CloudFormation::Stack  MetricsStack  
21:07:02 UTC+10 UPDATE_COMPLETE AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup 
21:06:58 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-2e3ce11f
21:06:58 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-293ce118
21:06:58 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-3502df04
21:06:58 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-0b77820a
21:06:58 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-c07580c1
21:04:52 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup New instance(s) added to autoscaling group - Waiting on 5 resource signal(s) with a timeout of PT15M.
21:04:52 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Successfully terminated instance(s) [i-2ae73a1b,i-64d52065,i-278b2026,i-73e43942,i-ddd520dc] (Progress 100%).
21:04:50 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Terminating instance(s) [i-2ae73a1b,i-64d52065,i-278b2026,i-73e43942,i-ddd520dc]; replacing with 5 new instance(s).
21:04:49 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Temporarily setting autoscaling group MinSize and DesiredCapacity to 10.
21:04:49 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Rolling update initiated. Terminating 5 obsolete instance(s) in batches of 5. Waiting on resource signals with a timeout of PT15M when new instances are added to the autoscaling group.
21:04:45 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup 
21:04:43 UTC+10 UPDATE_COMPLETE AWS::AutoScaling::LaunchConfiguration AgentLaunchConfiguration  
21:04:42 UTC+10 UPDATE_IN_PROGRESS  AWS::CloudFormation::Stack  MetricsStack  
21:04:36 UTC+10 UPDATE_ROLLBACK_IN_PROGRESS AWS::CloudFormation::Stack  elastic-runners The following resource(s) failed to update: [AgentAutoScaleGroup].
21:04:34 UTC+10 UPDATE_FAILED AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Failed to receive 5 resource signal(s) within the specified duration
20:24:31 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup New instance(s) added to autoscaling group - Waiting on 5 resource signal(s) with a timeout of PT40M.
20:24:30 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Successfully terminated instance(s) [i-47e93476,i-3fed183e,i-6bb61d6a,i-58ea3769,i-4ffa0f4e] (Progress 50%).
20:24:29 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Terminating instance(s) [i-47e93476,i-3fed183e,i-6bb61d6a,i-58ea3769,i-4ffa0f4e]; replacing with 5 new instance(s).
20:24:28 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-278b2026
20:24:28 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-73e43942
20:24:28 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-64d52065
20:24:28 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-2ae73a1b
20:24:06 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Received SUCCESS signal with UniqueId i-ddd520dc
20:20:14 UTC+10 UPDATE_COMPLETE AWS::CloudFormation::Stack  MetricsStack  
20:18:16 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup New instance(s) added to autoscaling group - Waiting on 5 resource signal(s) with a timeout of PT40M.
20:18:15 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Temporarily setting autoscaling group MinSize and DesiredCapacity to 10.
20:18:15 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup Rolling update initiated. Terminating 10 obsolete instance(s) in batches of 5. Waiting on resource signals with a timeout of PT40M when new instances are added to the autoscaling group.
20:18:11 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::AutoScalingGroup  AgentAutoScaleGroup 
20:18:04 UTC+10 UPDATE_COMPLETE AWS::AutoScaling::LaunchConfiguration AgentLaunchConfiguration  
20:18:03 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::LaunchConfiguration AgentLaunchConfiguration  Resource creation Initiated
20:18:02 UTC+10 UPDATE_IN_PROGRESS  AWS::AutoScaling::LaunchConfiguration AgentLaunchConfiguration  Requested update requires the creation of a new physical resource; hence creating one.
20:17:56 UTC+10 UPDATE_IN_PROGRESS  AWS::CloudFormation::Stack  MetricsStack  
20:17:48 UTC+10 UPDATE_IN_PROGRESS  AWS::CloudFormation::Stack  elastic-runners User Initiated

The ASG log:

21:04:57 UTC+10 Successful Launching a new EC2 instance: i-2e3ce11f
21:04:57 UTC+10 Successful Launching a new EC2 instance: i-293ce118
21:04:57 UTC+10 Successful Launching a new EC2 instance: i-3502df04
21:04:57 UTC+10 Successful Launching a new EC2 instance: i-0b77820a
21:04:57 UTC+10 Successful Launching a new EC2 instance: i-c07580c1
21:04:51 UTC+10 Successful Terminating EC2 instance: i-64d52065
21:04:51 UTC+10 Successful Terminating EC2 instance: i-2ae73a1b
21:04:51 UTC+10 Successful Terminating EC2 instance: i-ddd520dc
21:04:51 UTC+10 Successful Terminating EC2 instance: i-73e43942
21:04:50 UTC+10 Successful Terminating EC2 instance: i-278b2026
20:24:29 UTC+10 Successful Terminating EC2 instance: i-3fed183e
20:24:29 UTC+10 Successful Terminating EC2 instance: i-47e93476
20:24:29 UTC+10 Successful Terminating EC2 instance: i-4ffa0f4e
20:24:29 UTC+10 Successful Terminating EC2 instance: i-58ea3769
20:24:29 UTC+10 Successful Terminating EC2 instance: i-6bb61d6a
20:22:00 UTC+10 Successful Launching a new EC2 instance: i-2ae73a1b
20:22:00 UTC+10 Successful Launching a new EC2 instance: i-278b2026
20:22:00 UTC+10 Successful Launching a new EC2 instance: i-73e43942

Our code that reports if the instance came up successfully is:

/opt/aws/bin/cfn-init -s $(AWS::StackId) -r AgentLaunchConfiguration --region $(AWS::Region)
/opt/aws/bin/cfn-signal -e \$? -r 'cfn-init finished' \
--stack $(AWS::StackName) --resource 'AgentAutoScaleGroup' --region $(AWS::Region)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.