eximchain / terraform-aws-quorum-cluster Goto Github PK

View Code? Open in Web Editor NEW

13.0 4.0 10.0 38.43 MB

A tool to launch a quorum cluster in AWS leveraging Hashicorp software.

License: Other

HCL 77.69% Python 5.11% Shell 17.04% JavaScript 0.16%

geth constellation quorum terraform

terraform-aws-quorum-cluster's Introduction

Warning
Work In Progress
Quick Start Guide
- Prerequisites
- Supported Regions
Generate SSH key for EC2 instances
Using as a Terraform Module
Architecture
- Terraform Modules
- Diagrams

Created by gh-md-toc

Warning

This software launches and uses real AWS resources. It is not a demo or test. By using this software, you will incur the costs of any resources it uses in your AWS account.

Please be aware of the variables when using packer and terraform, as certain settings have the potential to incur large costs.

Work In Progress

This repository is a work in progress. A more complete version of this README and code is coming soon.

Quick Start Guide

Prerequisites

You must have AWS credentials at the default location (typically ~/.aws/credentials)
You must have the following programs installed on the machine you will be using to launch the network:
- Python 2.7
- Hashicorp Packer
- Hashicorp Terraform

Supported Regions

The following AWS regions are supported for use with this tool. Attempting to use regions not on this list may result in unexpected behavior. Note that this list may change over time in the event new regions are added to AWS infrastructure or incompatibilities with existing regions are added or discovered.

us-east-1
us-east-2
us-west-1
us-west-2
eu-central-1
eu-west-1
eu-west-2
ap-south-1
ap-northeast-1
ap-northeast-2
ap-southeast-1
ap-southeast-2
ca-central-1
sa-east-1

Generate SSH key for EC2 instances

Generate an RSA key with ssh-keygen. This only needs to be done once. If you change the output file location you must change the key paths in the terraform variables file later.

$ ssh-keygen -t rsa -f ~/.ssh/quorum
# Enter a password if you wish

Add the key to your ssh agent. This must be done again if you restart your computer. If this is not done, it will cause problems provisioning the instances with terraform.

$ ssh-add ~/.ssh/quorum
# Enter your password if there is one

Build AMIs to launch the instances with

Use packer to build the AMIs needed to launch instances

You may skip this step. If you do, your AMI will be the most recent one built by the official Eximchain AWS Account. We try to keep this as recent as possible but currently no guarantees are made.

If you want the script to copy the vault_consul AMI, ensure it is only built into the region the vault cluster will be in.

$ cd packer
$ packer build vault-consul.json
# Wait for build
$ packer build bootnode.json
# Wait for build
$ packer build quorum.json
# Wait for build
$ cd ..

These builds can be run in parallel as well

Then copy the AMIs to into terraform variables

$ python copy-packer-artifacts-to-terraform.py

If you would like to back up the previous AMI variables in case something goes wrong with the new one, you can use this invocation instead

$ BACKUP=<File path to back up to>
$ python copy-packer-artifacts-to-terraform.py --tfvars-backup-file $BACKUP

Faster Test Builds

If you want to quickly build an AMI to test changes, you can use an insecure-test-build. This skips over several lengthy software upgrades that require building a new software version from source. The AMIs produced will have additional security vulnerabilities and are not suitable for use in production systems.

To use this feature, simply run the builds from the packer/insecure-test-builds directory as follows:

$ cd packer/insecure-test-builds
$ packer build vault-consul.json
# Wait for build
$ packer build bootnode.json
# Wait for build
$ packer build quorum.json
# Wait for build
$ cd ../..

Then continue by copying the AMIs to into terraform variables as usual:

$ python copy-packer-artifacts-to-terraform.py

Generate Certificates

Certificates need to be generated separately, before launching the network. This allows us to delete the state for the cert-tool, which contains the certificate private key, for improved security in a production network.

Change to the cert-tool directory

$ cd terraform/cert-tool

Copy the example.tfvars file

$ cp example.tfvars terraform.tfvars

Then open terraform.tfvars in a text editor and change anything you'd like to change.

Finally, init and apply the configuration

$ terraform init
$ terraform apply
# Respond 'yes' at the prompt

Take note of the output. You will need to input some values into the terraform variables for the next configuration.

If this is an ephemeral test network, you do not need to recreate the certificates every time you replace the network. You can run it once and reuse the certificates each time.

Delete Terraform State

If this is a production network, or otherwise one in which you are concerned about security, you will need to delete the terraform state, since it contains the plaintext private key, even if you enabled KMS encryption.

$ rm terraform.tfstate*

Be aware that an aws_iam_server_certificate and an aws_kms_key are both created by this configuration, and if the state is deleted they will no longer be managed by Terraform. Be sure you have saved the output from the configuration so that it can be imported by other configurations or cleaned up manually.

Launch Network with Terraform

Change to the terraform directory, if you aren't already there from the above step

$ cd terraform

Copy the example.tfvars file

$ cp example.tfvars terraform.tfvars

Check terraform.tfvars and change any values you would like to change:

Certificate Details: You will need to fill in the values for cert_tool_kms_key_id and cert_tool_server_cert_arn. Replace FIXME with the values from the output of the cert-tool.
SSH Location: Our default example file is built for OS X, which puts your home directory and its .ssh folder (aka ~/.ssh) at /Users/$USER/.ssh. If your SSH keyfile is not located within that directory, you will need to update the public_key_path.
Network ID: We have a default network value. If there is already a network running with this ID on your AWS account, you need to change the network ID or there will be a conflict.
Not Free: The values given in example.tfvars are NOT completely AWS free tier eligible, as they include t2.small and t2.medium instances. We do not recommend using t2.micro instances, as they were unable to compile solidity during testing.
Bootnode Elastic IPs: Elastic IP addresses for bootnodes are disabled by default because AWS requires you to manually request more EIPs if you configure a network with more than 5 bootnodes per region. Enabling this feature (use_elastic_bootnode_ips) will maintain one static IP address for each bootnode for the lifetime of the network, keeping you from having to update stored enode addresses when bootnodes fail over.

If it is your first time using this package, you will need to run terraform init before applying the configuration.

Apply the terraform configuration

$ terraform apply
# Enter "yes" and wait for infrastructure creation

Note the IPs in the output or retain the terminal output. You will need them to finish setting up the cluster.

Launch and configure vault

Pick a vault server IP to ssh into:

$ IP=<vault server IP>
$ ssh ubuntu@$IP

Initialize the vault. Choose the number of key shards and the unseal threshold based on your use case. For a simple test cluster, choose 1 for both. If you are using enterprise vault, you may configure the vault with another unseal mechanism as well.

$ KEY_SHARES=<Number of key shards>
$ KEY_THRESHOLD=<Number of keys needed to unseal the vault>
$ vault operator init -key-shares=$KEY_SHARES -key-threshold=$KEY_THRESHOLD

Unseal the vault and initialize it with permissions for the quorum nodes. Once setup-vault.sh is complete, the quorum nodes will be able to finish their boot-up procedure. Note that this example is for a single key initialization, and if the key is sharded with a threshold greater than one, multiple users will need to run the unseal command with their shards.

$ UNSEAL_KEY=<Unseal key output by vault operator init command>
$ vault operator unseal $UNSEAL_KEY
$ ROOT_TOKEN=<Root token output by vault operator init command>
$ /opt/vault/bin/setup-vault.sh $ROOT_TOKEN

If any of these commands fail, wait a short time and try again. If waiting doesn't fix the issue, you may need to destroy and recreate the infrastructure.

Unseal additional vault servers

You can proceed with initial setup with only one unsealed server, but if all unsealed servers crash, the vault will become inaccessable even though the severs will be replaced. If you have multiple vault servers, you may unseal all of them now and if the server serving requests crashes, the other servers will be on standby to take over.

SSH each vault server and for enough unseal keys to reach the threshold run:

$ UNSEAL_KEY=<Unseal key output by vault operator init command>
$ vault operator unseal $UNSEAL_KEY

Access the Quorum Node

SSH any quorum node and wait for the exim and constellation processes to start. There is an intentional delay to allow bootnodes to start first.

Check processes have started

One way to check is to inspect the log folder. If exim and constellation have started, we expect to find logs for constellation and quorum, not just init-quorum.

$ ls /opt/quorum/log

Another way is to check the supervisor config folder. if exim and constellation have started, we expect to find files quorum-supervisor.conf and constellation-supervisor.conf.

$ ls /etc/supervisor/conf.d

Finally, you can check for the running processes themselves. Expect to find a running process other than your grep for each of these.

$ ps -aux | grep constellation-node
$ ps -aux | grep exim

Attach the Exim Console

Once the processes are all running, you can attach your console to the exim JavaScript console

$ exim attach

You should be able to see your other nodes as peers

> admin.peers

Run Private Transaction Test

The nodes come equipped to run a simple private transaction test (sourced from the official quorum-examples repository) between two nodes.

Deploy the private contract

SSH into the sending node (e.g. node 0) and run the following to deploy the private contract

If you are using Foxpass SSH key management, first authenticate to vault with AWS. You will also need to use sudo to run the test

$ vault auth -method=aws
$ RECIPIENT_PUB_KEY=$(vault read -field=constellation_pub_key quorum/addresses/us-east-1/1)
$ sudo /opt/quorum/bin/private-transaction-test-sender.sh $RECIPIENT_PUB_KEY

Otherwise, you should be authenticated already and sudo is not necessary

# This assumes that the entire network is running in us-east-1
# This assumes there are at least two nodes in us-east-1 and the recipient is the node with index 1
# (the second maker node, or the first validator node if there is only one maker in us-east-1)
# If you would like to choose a different recipient, modify the path beginning with "quorum/addresses"
$ RECIPIENT_PUB_KEY=$(vault read -field=constellation_pub_key quorum/addresses/us-east-1/1)
$ /opt/quorum/bin/private-transaction-test-sender.sh $RECIPIENT_PUB_KEY

The exim console will be attached. Wait for output indicating the contract was mined, which should appear as follows:

> Contract mined! Address: 0x74d977a43deaac2281b6f3d489719f6d2e4aae74
[object Object]

Take note of the address, then in another terminal, SSH into the recipient node and run the following to load the private contract:

$ CONTRACT_ADDR=<Address of the mined private contract>
$ /opt/quorum/bin/private-transaction-test-recipient.sh $CONTRACT_ADDR

The exim console will be attached and the private contract will be loaded. Both the sender and recipient should be able to get the following result from querying the contract:

> simple.get()
42

To demonstrate privacy, you can run the recipient script on a third instance that is not the intended recipient:

$ CONTRACT_ADDR=<Address of the mined private contract>
$ /opt/quorum/bin/private-transaction-test-recipient.sh $CONTRACT_ADDR

The third instance should get the following result instead when querying the contract:

> simple.get()
0

Destroy the Network

If this is a test network and you are finished with it, you will likely want to destroy your network to avoid incurring extra AWS costs:

# From the terraform directory
$ terraform destroy
# Enter "yes" and wait for the network to be destroyed

If it finishes with a single error that looks like as follows, ignore it. Rerunning terraform destroy will show that there are no changes to make.

Error: Error applying plan:

1 error(s) occurred:

* aws_s3_bucket.quorum_vault (destroy): 1 error(s) occurred:

* aws_s3_bucket.quorum_vault: Error deleting S3 Bucket: NoSuchBucket: The specified bucket does not exist
	status code: 404, request id: 8641A613A9B146ED, host id: TjS8J2QzS7xFgXdgtjzf6FR1Z2x9uqA5UZLHaMEWKg7I9JDRVtilo6u/XSN9+Qnkx+u5M83p4/w= "quorum-vault"

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Using as a Terraform Module

This repository maintains a terraform module, which you can add to your code by adding a module configuration and setting the source to the URL of the module:

module "quorum_cluster" {
  # Use v0.0.1-alpha
  source = "github.com/Eximchain/terraform-aws-quorum-cluster//terraform/modules/quorum-cluster?ref=v0.0.1-alpha"

  # These values from example.tfvars
  public_key_path           = "~/.ssh/quorum.pub"
  key_name                  = "quorum-cluster"
  aws_region                = "us-east-1"
  network_id                = 64813
  force_destroy_s3_buckets  = true
  quorum_azs                = ["us-east-1a", "us-east-1b", "us-east-1c"]
  vault_cluster_size        = 1
  vault_instance_type       = "t2.small"
  consul_cluster_size       = 1
  consul_instance_type      = "t2.small"
  bootnode_cluster_size     = 1
  bootnode_instance_type    = "t2.small"
  quorum_maker_instance_type = "t2.medium"
  quorum_validator_instance_type = "t2.medium"
  quorum_observer_instance_type = "t2.medium"
  num_maker_nodes           = 1
  num_validator_nodes       = 1
  num_observer_nodes        = 1
  vote_threshold            = 1

  # Currently assuming these are filled in by variables
  quorum_amis   = "${var.quorum_amis}"
  vault_amis    = "${var.vault_amis}"
  bootnode_amis = "${var.bootnode_amis}"
}

Architecture

Terraform Modules

The following modules can be found in the terraform/modules directory. The root quorum configuration in terraform is simply a wrapper for the quorum-cluster module.

These modules contain the core functionality to run the infrastructure for a quorum cluster:

quorum-cluster

The top-level module, suitable for being used directly by another terraform configuration. The quorum-cluster module consists primarily of a single quorum-vault in the primary region, and 14 quorum-cluster-region modules which contain the bulk of the infrastructure (since most of it is regionalized). Additionally, it contains some resources to create a cloudwatch dashboard and alarms in the default region.

quorum-vault

The quorum-vault module provides a durable and secure Hashicorp Vault cluster for use by the whole cluster. This module is used only in the primary region. It maintains a vault cluster, a consul cluster to support it, and an Elastic Load Balancer through which the vault cluster can be accessed. This module is not intended to be used outside a quorum-cluster module.

quorum-cluster-region

The quorum-cluster-region module contains all infrastructure which exists independently in all regions, which is most of it. Note that no infrastructure is created in a region with all node counts set to 0.

The following major components are included in a quorum-cluster-region:

For the whole region
- A key pair to SSH the instances in the cluster
- An S3 bucket for constellation payloads
- An S3 bucket for chain backups
- An IAM policy allowing access to AWS dependencies
For Bootnodes
- A VPC
- One subnet per AZ
- User data scripts
- One Autoscaling group per bootnode
- One IAM role per bootnode
- One security group and rules for it allowing:
  - SSH access which may be limited to specified IPs
  - Access to the constellation port from anywhere
  - Access to the quorum port from anywhere
  - Access to the bootnode discovery port from anywhere
  - Local access to the RPC port
For Quorum Nodes
- A VPC
- Three subnets per AZ, one for each network role
- User data scripts
- One Autoscaling group per node
- One IAM role per node
- One security group and rules for it allowing:
  - SSH access which may be limited to specified IPs
  - Access to the constellation port from anywhere
  - Access to the quorum port from specified other roles (see Network Topology for more details)
  - Access to the bootnode discovery port from anywhere
  - Local access to the RPC port
  - Local access to the supervisor RPC port
- Network ACL rules preventing makers and supervisors from communicating on their exim ports (see Network Topology for more details)

The following modules contain supporting functionality

cert-tool

The cert-tool module is used in the Quick Start Guide to generate certificates for vault. This needs to be done outside the main module to avoid having unencrypted private keys persisted in the terraform state.

consul-security-group-rule

This module is originally sourced from a Hashicorp module. It provides security group rules for the consul cluster used in the quorum-vault module.

internal-dns

This module provides a shared private DNS system for the whole cluster by creating a Route53 private hosted zone and associating it with all VPCs in the cluster.

Currently this provides a fixed well-known DNS for the vault load balancer so that the certificates can be generated before the load balancer is created.

quorum-vpc-peering

This creates peering connections between the vault VPC and each quorum VPC, as well as between each pair of quorum VPCs. The result is that all VPCs except the bootnode VPCs are connected to each other via peering connections, and can communicate over them.

This has desirable properties. One is that the vault load balancer can be kept internal, reducing the attack surface for the vault server. Another is that exim processes establish connections using their private IPs, which allows us to set cross-region security group rules based on private IP CIDR ranges. This is important in enforcing the Network Topology.

Diagrams

Note that for simplicity, these diagrams depict a three region network. The primary region is us-east-1 and the network also has nodes in us-west-2 and eu-west-1. Additional regions used that may be used in your network have the same architecture as the non-primary regions depicted.

Full network at a high level

This diagram shows the breakdown of the architecture into regions and VPCs, including components that are exclusive to the primary region. The components common to all regions will be expanded upon in another diagram. Note that connections between components are omitted to avoid clutter.

VPC Peering Connections

This diagram shows the VPC Peering Connections between VPCs. The Vault VPC and the Quorum Node VPCs are all directly connected to each other. Bootnode VPCs are not connected to any other VPCs. Also pictured is the Internal DNS system, consisting of a single Route53 private hosted zone associated with all VPCs including bootnode VPCs.

Quorum Cluster Region

This diagram shows a more detailed view of a non-primary region. The primary region has additional components as detailed in the full network diagram. This infrastructure is managed by the quorum-cluster-region module and exists in every region with nodes in them. For simplicity, connections between components are omitted and only two Availability Zones and two nodes per AZ are shown.

Network Topology

The network topology is enforced via the AWS control plane and attempts to obtain a fully connected network ensuring the best possible connectivity between makers and validators. Towards this end, all non-maker connections to the network go through observer nodes. To allow connections to your network, users should use the observer nodes as bootnodes.

Incoming network connections through other nodes are prevented by security group rules. Since connections can be opened in either direction, Makers and Observers are specifically kept from connecting to each other by ACL rules at the subnet level.

Through clever choice of max_peers and the number of nodes in your network, it is possible to ensure that your initial network is strongly connected.

terraform-aws-quorum-cluster's People

Contributors

Stargazers

Watchers

Forkers

betterpath valkwarble andrecronje andreweximchain john-osullivan excchua claimly lsquared13 alexbosy asmtal

terraform-aws-quorum-cluster's Issues

Reduce Threatstack alerts to manageable levels

Problem

There is currently a lot of noise in the threatstack alerts. With this level of noise, we are at risk of missing important alerts.

Solution

We need to reduce the false positive level by adding appropriate suppressions and making changes to this repository as necessary

Integrate Backup Procedure

Problem

We have written a backup script that can be used to back up chain data to s3 or restore from a previous backup, but it only runs on a completely manual basis.

Solution

The backup needs to run at least periodically with a cron job or something similar. Ideally, we should also be able to run it in response to a cloudwatch alarm, although that does not block network launch.

We also need a way to run the restoration for every instance simultaneously.

Regionalize Backup Procedure

Problem

Currently, the S3 backup procedure implemented in this script currently backs up the entire network to a single S3 bucket. This results in backups costing more money than they need to, since we pay for cross-region traffic. In particular, restoration will involve downloading the full chain data from every node.

Acceptance Criteria

There is an S3 backup bucket in each region there are quorum nodes
Nodes back up to and restore from their own region's bucket by default
There is a command line option to override the backup bucket name, in case we want to force use of another region's bucket.

Mechanism to remove a single node

We need a mechanism to remove a single node of any index, and keep the rest of the system running. This may involve manually moving things around in vault.

How to update consensus mechanisms to use RAFT or ISTANBUL

What is default consensus mechanisms?
How to update consensus to use RAFT or ISTANBUL?

ftp.gnu.org is broken

Today the FTP commands in the script stopped working. Need to fix this, it's breaking packer builds.

Tighten security group parameters on vault

Ideally we have the following

Vault can only take requests through the load balancer
Consul can only be reached by vault (or SSHed by engineer through VPN)
The Vault load balancer only takes requests from our own instances (This is the tricky part, making it work cross region in particular)

Disable Constellation on Mainnet

We don't need to run constellation on our main network nodes since they will never send or receive private transactions. We should disable it to reduce the number of things that could go wrong.

Need Process to Update Software Safely

Problem

Currently, we update software by tearing down the network and launching a new one with new software. This works because the networks are not currently hosting data that they expect to be persistent. We currently have no way to update software reliably without replacing the network.

Solution Reqirements

Before we can handle production networks, we need some mechanism to update software that doesn't destroy the data on the chain or bring the network to a halt.

Option to run on dedicated instances

Problem

Running on standard EC2 instances runs a risk of introducing vulnerabilities that exploit the fact that the physical hardware is being shared with other virtual machines. EC2 Dedicated Instances seem to solve this problem by running on physical hardware dedicated to a single customer for an additional cost. However, the current terraform infrastructure does not support that.

Solution

Each role should have an independent option to run on dedicated instances instead of regular instances.

Vault permissions should be more finely tuned

Problem

Currently, everything stored in the vault can be read by any quorum node. While this is intended for 'public' information like addresses, it is not ideal for 'private' information like passwords. Ideally, every node can only read its own private data.

The primary caveat is that all quorum nodes currently use the same IAM role.

Acceptance Criteria

A given quorum node can only read or write its own password
A given quorum node can only write its own address
A quorum node can read any node's address

Fix Medium Severity Threatstack Issues

Issue

We must upgrade the following software to versions without these vulnerabilities:

binutils
tcpdump
unzip
git
bzip2
systemd
gnupg
cron
ed

Enable passwords for constellation keys

Problem

We need password protection on our constellation keys

The previous issue we had might have been due to machines being too small, so we should try again before we start troubleshooting.

AMIs for Terraform Module Usage

Problem

Currently, we assume users have filled in their AMI terraform variables with AMIs built using our packer templates. This is not ideal for use as a module, as it requires the user to run packer builds in their own AWS account first, making the module not as self-contained as it could be.

Proposed Solution

Ideally I think we would maintain a set of publicly available AMIs built with this package and allow the user to override it with their own custom built AMIs if they choose. This would definitely require handling the vault certificate securely, and not just building it into the packer AMI.

init-quorum fails when there are no observers

Problem

The init-quorum.sh script fails when there are no observers due to unbound variable OBSERVERS[@]

The command line args that generate the quorum config should gracefully handle the case where a role is empty.

Bootnodes should be replaceable

Problem

When specifying bootnodes, quorum nodes use an enode address which includes the IP address of the bootnode. If a bootnode crashes and is replaced, the IP address will change, and an already active node will have no way of updating their bootnodes.

Potential Solutions

Dynamically update the nodes' list of bootnodes
Use an Elastic IP for bootnodes (The problem here is that accounts are limited to 5 per region, so that means accepting some limitations on how many networks one account can run at once and how many bootnodes per region they can have)
Use a NAT gateway or other networking component to allow nodes to refer to a fixed IP while replacing the underlying instances behind the scenes.

Multi-Region Network

Before we can do any real stress testing, we need to be able to launch a network across multiple regions. Initial plan is to attack this in two phases:

Initial Task Breakdown

Make Cluster Multi-Region Ready

Put the vault cluster behind an ELB
Vault certificates appropriately handle being behind an ELB
Quorum nodes communicate with vault via the ELB
Quorum nodes communicate using a public endpoint (probably public hostname)
Bootnodes can be addressed with a public IP

Launch a Multi-Region Cluster

AZs terraform variable is a map with AZs per Region
Instance count terraform variables are maps with counts per Region
Terraform launches instances in all desired Regions

Archival of Private Vault Cluster Functionality

While this is important functionality, it isn't a strict upgrade like some other changes. It will necessarily entail putting the vault cluster behind an IP accessible from the public internet. Some users may want the ability to make a single-region cluster with vault servers that are not behind a publicly accessible endpoint. I'm not sure how exactly to go about this yet but I have a few ideas on possibilities:

Find a way to make the ELB completely network private in the single-region case
Keep functionality to create a such a cluster alongside the multi-region one in the same version of the tool
Retain the single-region version that uses consul discovery in a separate branch

I'll probably branch this before committing anything related to this just-in-case. I'll probably just make the ELB conditionally internal. It seems like it's just this terraform attribute

Add Elastic IPs to Observers (Optionally)

Since we plan on having users connect through observer nodes, they should optionally have Elastic IPs associated with them, similarly to what we did for bootnodes.

Freshly applied terraform config still proposes changes

Problem

After running terraform apply to launch the a cluster, running terraform apply or terraform plan produces a plan that proposes the following changes:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place
-/+ destroy and then create replacement
 <= read (data resources)

Terraform will perform the following actions:

 <= module.quorum_cluster.module.quorum_vault.data.template_file.user_data_vault_cluster
      id:                                        <computed>
      rendered:                                  <computed>
      template:                                  "#!/bin/bash\n# This script is meant to be run in the User Data of each EC2 Instance while it's booting. The script uses the\n# run-consul script to configure and start Consul in client mode and then the run-vault script to configure and start\n# Vault in server mode. Note that this script assumes it's running in an AMI built from the Packer template in\n# examples/vault-consul-ami/vault-consul.json.\n\nset -e\n\n# Send the log output from this script to user-data.log, syslog, and the console\n# From: https://alestic.com/2010/12/ec2-user-data-output/\nexec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1\n\nreadonly VAULT_TLS_CERT_DIR=\"/opt/vault/tls\"\nreadonly CA_TLS_CERT_FILE=\"$VAULT_TLS_CERT_DIR/ca.crt.pem\"\nreadonly VAULT_TLS_CERT_FILE=\"$VAULT_TLS_CERT_DIR/vault.crt.pem\"\nreadonly VAULT_TLS_KEY_FILE=\"$VAULT_TLS_CERT_DIR/vault.key.pem\"\n\n# The variables below are filled in via Terraform interpolation\n/opt/vault/bin/generate-setup-vault.sh ${network_id}\n\n# Download vault certs from s3\naws configure set s3.signature_version s3v4\naws s3 cp s3://${vault_cert_bucket}/ca.crt.pem $VAULT_TLS_CERT_DIR\naws s3 cp s3://${vault_cert_bucket}/vault.crt.pem $VAULT_TLS_CERT_DIR\naws s3 cp s3://${vault_cert_bucket}/vault.key.pem $VAULT_TLS_CERT_DIR\n\n# Set ownership and permissions\nsudo chown vault:vault $VAULT_TLS_CERT_DIR/*\nsudo chmod 600 $VAULT_TLS_CERT_DIR/*\nsudo /opt/vault/bin/update-certificate-store --cert-file-path $CA_TLS_CERT_FILE\n\n/opt/consul/bin/run-consul --client --cluster-tag-key \"${consul_cluster_tag_key}\" --cluster-tag-value \"${consul_cluster_tag_value}\"\n/opt/vault/bin/run-vault --s3-bucket \"${s3_bucket_name}\" --s3-bucket-region \"${aws_region}\" --tls-cert-file \"$VAULT_TLS_CERT_FILE\"  --tls-key-file \"$VAULT_TLS_KEY_FILE\"\n"
      vars.%:                                    "6"
      vars.aws_region:                           "us-east-1"
      vars.consul_cluster_tag_key:               "consul-cluster"
      vars.consul_cluster_tag_value:             "quorum-consul"
      vars.network_id:                           "881"
      vars.s3_bucket_name:                       "quorum-vault-network-881-20180217012034246100000007"
      vars.vault_cert_bucket:                    "vault-certs-network-881-20180217012033328000000005"

  ~ module.quorum_cluster.module.quorum_vault.aws_s3_bucket.quorum_vault
      tags.%:                                    "1" => "0"
      tags.Description:                          "Used for secret storage with Vault. DO NOT DELETE this Bucket unless you know what you are doing." => ""

  ~ module.quorum_cluster.module.quorum_vault.module.vault_cluster.aws_autoscaling_group.autoscaling_group
      launch_configuration:                      "quorum-vault-network-881-20180217012808793700000027" => "${aws_launch_configuration.launch_configuration.name}"

-/+ module.quorum_cluster.module.quorum_vault.module.vault_cluster.aws_launch_configuration.launch_configuration (new resource required)
      id:                                        "quorum-vault-network-881-20180217012808793700000027" => <computed> (forces new resource)
      associate_public_ip_address:               "false" => "false"
      ebs_block_device.#:                        "0" => <computed>
      ebs_optimized:                             "false" => "false"
      enable_monitoring:                         "true" => "true"
      iam_instance_profile:                      "quorum-vault-network-8812018021701205695940000000e" => "quorum-vault-network-8812018021701205695940000000e"
      image_id:                                  "ami-cf757fb5" => "ami-cf757fb5"
      instance_type:                             "t2.small" => "t2.small"
      key_name:                                  "quorum-vault-network-881" => "quorum-vault-network-881"
      name:                                      "quorum-vault-network-881-20180217012808793700000027" => <computed>
      name_prefix:                               "quorum-vault-network-881-" => "quorum-vault-network-881-"
      placement_tenancy:                         "default" => "default"
      root_block_device.#:                       "1" => "1"
      root_block_device.0.delete_on_termination: "true" => "true"
      root_block_device.0.iops:                  "0" => <computed>
      root_block_device.0.volume_size:           "50" => "50"
      root_block_device.0.volume_type:           "standard" => "standard"
      security_groups.#:                         "1" => "1"
      security_groups.204346915:                 "sg-1bd08d6c" => "sg-1bd08d6c"
      user_data:                                 "478135dc5e38e83873310d004691d4582911773d" => "643cd0eab8f3d7def9cef600a65268cf39d49ecf" (forces new resource)


Plan: 1 to add, 2 to change, 1 to destroy.

Expected Behavior

Running terraform apply or terraform plan after successfully applying the config should propose no changes. The tag change is ultimately harmless, but the user data change causes a relaunch of the vault cluster, which could interrupt vault availability.

Rewrite cloudwatch-metrics.sh script in another language

Problem

While a shell script for emitting cloudwatch metrics was reasonable when it was limited to polling a single RPC call, it's quickly becoming messy and difficult to develop as I add more metrics and more complex logic

Solution

Write a metrics agent in another language (probably Python, maybe Go) that simply uses a geth client and a cloudwatch client to do the same thing.

Flesh Out Resource Tagging

Problem

The tags for our instances are currently very specific. We should add some tags to aggregate instances across dimensions we care about.

Solution

Add tags to aggregate across:

An entire network
Each role within a network
Update group (once we have a concept of that)

Solidity Compilation Fails

Problem

The software we had been running depends on the ethereum ubuntu ppa. An upstream update to the solidity compiler solc has rendered us unable to send private transactions. We encounter the following error:

Fatal: Failed to start the JavaScript console: /home/ubuntu/private-transaction-test-sender.js: Error: solc: exit status 1
unrecognised option '--add-std'

    at web3.js:3122:20
    at web3.js:6026:15
    at web3.js:4998:36
    at /home/ubuntu/private-transaction-test-sender.js:5:22

Solution

We will need to build working versions of these things into our own ppa and remove our dependency on the external ppa.

Geth passwords should not sit in plaintext in supervisor config

Problem

Currently, the passwords that encrypt the geth keys are passed to geth via the command line, and they are stored in plaintext in the supervisor config. This makes the password effectively useless at protecting the key.

Ideal Solution

Geth could, instead of the plaintext password, be passed a location to retrieve the key from in vault. The password could be retrieved from vault by geth directly, and we could avoid persisting the plaintext password in any permanent form.

Other solutions may be acceptable after determining and reviewing a threat model for attacks involving the password.

Change terraform boolean variables to use "true" & "false"

This is a low-priority issue, just filing for documentation purposes. Today's version of Terraform does inconsistent conversion for boolean variables when they are specified as actual booleans like so:

variable "active" {
  default = false
}

According to the Terraform docs, when variables are specified within the tfvars file, that false gets converted to a 0. When the variable is specified from the command line or as an environment variable, however, then it gets converted to "false". They're going to fix this inconsistency in a future update, such that all booley values (e.g. "0", "1", "true", "false") are consistently converted to booleans. In order to maintain compatibility, the best practice right now is to specify boolean variables as strings, like:

variable "active" {
  default = "false"
}

This isn't a launch blocker by any means, but a convention we should get in line with at some point in the future.

Fix vim vulnerabilities

We have new High Severity vulnerabilities popping up on Threatstack due to vim now. Ideally we can patch it.

Replacing quorum node loses private transaction data

A replacement quorum node that has received a private transaction before being replaced will no longer be able to see the private value.

Steps to Reproduce

Run through the setup, including the private transaction test
Terminate the recipient node
Run terraform apply to replace the instance
Re-run the recipient portion of the private transaction test
In the geth console type simple.get()

Expected Behavior

> simple.get()
42

Observed Behavior

> simple.get()
0

Other Symptoms

The following log is observed in the constellation error log on the replacement instance:

[WARN] Error performing API request: ApiReceive (Receive {rreqKey = "MqYH+VjT2wATh0Ow1FuQ7ADM6kJLGzRN7/ageMwKrFyt5nQDxhjrAjguoP08JPqqDQJqFx6zSZjvIhHsTIqBcg==", rreqTo = "eE+KAGTa7e+GwJzVJ78fEkNpCvZc1LjSjaYOKfyftQY="}); Payload not found in directory /opt/quorum/constellation/data: /opt/quorum/constellation/data/payloads/GKTAP6KY2PNQAE4HIOYNIW4Q5QAMZ2SCJMNTITPP62QHRTAKVROK3ZTUAPDBR2YCHAXKB7J4ET5KUDICNILR5M2JTDXSEEPMJSFIC4Q=: openBinaryFile: does not exist (No such file or directory)

It seems like the payloads need to be persisted across replacements.

Handle vault certificate private key securely

Problem

The handling of vault certificates for TLS is currently insecure in the following ways:

The certificates including the private key are currently stored in an encrypted S3 bucket. However, it is only encrypted server-side and the private key is fetched via HTTP, unencrypted, without TLS.
We use a terraform module based cert-tool as part of our workflow to generate the private certificates. Since the cert-tool is now in the middle of the workflow, the private key now ends up in the terraform state, making it a security risk to use a remote backend in particular.

Solution Proposals

(Will be updated with any worthwhile suggestions I think of or that are proposed in comments)

Encrypt the certificates with a KMS key before uploading them to S3. Allow quorum nodes to decrypt the certificates after downloading them.
Find a way to use TLS when downloading the certificates

Validate Blocks by Region

Issue

Since we have a DynamoDB table with the blocks mined per region, we should check that the proportion of blocks created by region is roughly proportional to the number of makers in each region.

Get AZs from Data Source instead of variable

Eliminates source of user error

Integrate with EFS for storage

Problem

Our nodes need to store the entire blockchain in storage, and the chain grows over time. It would be extremely expensive to provision all the space we will need to run our network for as long as we expect to need to.

Proposed Solution

We will need to start with a moderate amount of space (~250 GB) and have a plan in place to upgrade our storage space when it runs short, without taking down the whole network at once.

Test private transactions with the gasPrice activated version of quorum.

Don't create any infrastructure in unused regions

It's looking like the initial implementation of a multi-region network will create a lot of supporting infrastructure even in regions that aren't used. Ideally it all should be created conditionally on any instances being placed in that region.

Look into scoping credentials

Heard about it on Threatstack webinar. We should make sure our credentials are scoped to the tightest restrictions possible. May be able to force having certain IPs etc.

Build out basic alarms

We need basic alarms to monitor the network. We already have a lot of metrics but we need alarms on the important ones.

May need to create new metrics for some.

Set up emergency access for vault via Okta

We should have a way to get master access to the secrets in our vault in a pinch. We'll probably want to set up peering connections between our VPN VPC and our vault. We may need to change the VPN CIDR. We must be able to handle the VPN and vault in different accounts (No auto-accept peering connection)

Mechanism to provide more detail when alarms trigger

Problem

Currently, if we get an alarm notifying us of something like a crashed quorum process, we have no way to tell basic details like what instance it came from

Solution

We need a mechanism that can give us additional information about such things. Our initial goal should be getting the Instance ID, Public DNS, and Region when we get a metric for process crashes, but the design should be extensible to providing other data when metrics are emitted and/or alarms go off. It is acceptable for us to have to retrieve the information as long as we can do so in O(1) time with respect to the number of instances in the network.

I suspect CloudWatch logs are the best tool for the job, but we'll need to evaluate alternatives

Keep EBS volumes during deployment

Currently, we have our nodes redownload the block history from other nodes when deploying them. We could figure out how to detach and reattach the EBS volumes instead, so we don't need to do that

Integrate FoxPass SSH Key Management

We'll want to integrate with the FoxPass key management so that we can manage SSH access through that.

Investigate instance type mismatch

But the variables use g2.2xlarge instances for makers, not observers.

Need to investigate why this is an issue

Sporadic process crash metrics emitted

Problem

During quorum stress tests, we noticed that sometimes a sporadic cloud watch metric indicating a process crash will be emitted, generally indicating a crash in both constellation and quorum. Upon inspection of the logs, no evidence of a crash can be found. The most likely scenario is that a bug of some sort is causing the event listener to emit a metric when the process has not crashed.

eximchain / terraform-aws-quorum-cluster Goto Github PK

terraform-aws-quorum-cluster's Introduction

Table of Contents

Warning

Work In Progress

Quick Start Guide

Prerequisites

Supported Regions

Generate SSH key for EC2 instances

Build AMIs to launch the instances with

Faster Test Builds

Generate Certificates

Delete Terraform State

Launch Network with Terraform

Launch and configure vault

Unseal additional vault servers

Access the Quorum Node

Check processes have started

Attach the Exim Console

Run Private Transaction Test

Deploy the private contract

Destroy the Network

Using as a Terraform Module

Architecture

Terraform Modules

quorum-cluster

quorum-vault

quorum-cluster-region

cert-tool

consul-security-group-rule

internal-dns

quorum-vpc-peering

Diagrams

Full network at a high level

VPC Peering Connections

Quorum Cluster Region

Network Topology

terraform-aws-quorum-cluster's People

Contributors

Stargazers

Watchers

Forkers

terraform-aws-quorum-cluster's Issues

Problem

Solution

Problem

Solution

Problem

Acceptance Criteria

Problem

Solution Reqirements

Problem

Solution

Problem

Acceptance Criteria

Issue

Problem

Problem

Proposed Solution

Problem

Problem

Potential Solutions

Initial Task Breakdown

Make Cluster Multi-Region Ready

Launch a Multi-Region Cluster

Archival of Private Vault Cluster Functionality

Problem

Expected Behavior

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Ideal Solution

Problem

Solution Proposals

Issue

Problem