rust-lang / simpleinfra Goto Github PK

Rust Infrastructure automation

License: MIT License

Shell 4.74% Rust 3.86% Dockerfile 0.03% Python 10.10% HCL 79.04% JavaScript 2.22%

simpleinfra's Introduction

Simpleinfra

This repository containing the tools and automation written by the Rust infrastructure team to manage our services. Using some of the tools in this repo require privileges only infra team members have.

ansible: Ansible playbooks to deploy our servers
aws-creds: log into AWS with two factor authentication
github-actions: shared actions for GitHub Actions
setup-deploy-keys: automation for GitHub deploy keys
terraform: Terraform configuration to deploy our cloud resources
with-rust-key: execute commands using the Rust release signing key

The contents of this repository are released under the MIT license.

setup-deploy-keys

Using Personal Access Tokens to upload to GitHub pages from CI is not great from a security point of view, as it's not possible to scope those access tokens to just that repository. Deploy keys are properly scoped, but it can be an hassle to generate and configure them.

The setup-deploy-keys tool automates most of that process. You need to setup your GitHub token in the GITHUB_TOKEN environment variable, and then run:

cargo run --bin setup-deploy-keys org-name/repo-name

The tool will generate a key, upload it to GitHub and then print an environment variable GITHUB_DEPLOY_KEY containing an encoded representation of the private key.

To use the key the easiest way is to cd into the directory you want to deploy, download this rust program, compile and run it (with the GITHUB_DEPLOY_KEY variable set).

By default the tool generates ed25519 keys, but some libraries (like git2) don't support them yet. In those cases you can generate RSA keys by passing the --rsa flag:

cargo run --bin setup-deploy-keys org-name/repo-name --rsa

with-rust-key

The with-rust-key.sh script executes a command inside a gpg environment configured to use the Rust release signing key, without actually storing the key on disk. The key is fetched at runtime from the 1password sensitive vault, and you need to have jq and the 1password CLI installed.

For example, to create a git tag for the Rust 2.0.0 release you can use:

./with-rust-key.sh gpg tag -u FA1BE5FE 2.0.0 stable

The script is designed to leave no traces of the key on the host system after it finishes, but a program with your user's privileges can still interact with the key as long as the script is running.

simpleinfra's People

Contributors

Stargazers

Watchers

simpleinfra's Issues

awscreds fails when run in same shell

As far as I can tell, the awscreds script will fail to run if you run it again in the same shell (i.e. AWS_SESSION_TOKEN is still set), as that token is no longer valid after 12 hours, but the AWS CLI will continue to use it.

For now I've manually worked around this via unset AWS_SESSION_TOKEN but we should probably aim to do so in the script? Not sure.

Prune docker cache in docs-rs builder

[Originally reported here]

We want to make sure that we're not growing storage on the builder in an unbounded fashion.

CloudFront invalidations for docs-rs staging

The docs-rs staging environment does not support for CloudFront invalidations which happens through the CLOUDFRONT_DISTRIBUTION_ID_WEB and CLOUDFRONT_DISTRIBUTION_ID_STATIC environment variables. These needed to be added to the web and background builder.

Redirect minor versions to latest patch

We got a request on Zulip to redirect requests for the documentation from the minor version (e.g. https://doc.rust-lang.org/1.65/std/boxed/struct.Box.html) to the latest patch for that version (e.g. https://doc.rust-lang.org/1.65.0/std/boxed/struct.Box.html).

The reason is that the first URL does not exist and will return HTTP 404 Not Found, while the latter works fine.

The CloudFront distribution is configured here and is already using a Lambda router to issue some redirects.

Migrate docs.rs to RDS and ECS

Questions

How can the docs.rs team run one-off commands?
How are database migrations run as part of the deploy process?

`dev-desktop-us-1` does not respond on IPv6

It looks like dev-desktop-us-1.infra.rust-lang.org has dual-stack networking:

$ host dev-desktop-us-1.infra.rust-lang.org
dev-desktop-us-1.infra.rust-lang.org has address 44.204.37.156
dev-desktop-us-1.infra.rust-lang.org has IPv6 address 2600:1f18:619f:e701:fee2:1ae2:4384:ed4c

However, I don't get any response on IPv6. This is annoying because my ssh seems to try IPv6 first by default, whereas ssh -4 connects to this host fine. For now, I've forced it to AddressFamily inet in my ssh_config.

I don't need IPv6 access, but it shouldn't be in DNS if it doesn't work. In comparison, us-2 only advertises IPv4.

It looks like eu-1 and eu-2 are in the same situation.

RCS and rsync

I saw that the restart script for now starts out with an rsync, and it also has --delete! I don't currently store the secrets locally, so maybe we could skip that for now and just persist the secrets on the target machine? (until we sort out #2)

RCS and shells?

I tried using the script this morning but it failed for a few reasons:

Executed as ./restart-rcs.sh it says "./restart-rcs.sh: 1: set: Illegal option -o pipefail", but I fixed this with bash ./restart-rcs.sh
Otherwise the ssh command doesn't actually do anything. I didn't get an error message or anything but I figured that it was something odd with the string syntax. I'm terrible with multiline shell strings!

Did you use bash for this? Another shell? Don't mind installing anything, just curious! (and the script should prolly start w/ #!/bin/$shell)

LLDB fails to start on the cloud dev machines

> lldb --version
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'lldb.embedded_interpreter'
lldb version 14.0.0
> x t tests/debuginfo/basic-types-globals-metadata.rs
Check compiletest suite=debuginfo mode=debuginfo (aarch64-unknown-linux-gnu -> aarch64-unknown-linux-gnu)

running 2 tests
iF
failures:

---- [debuginfo-lldb] tests/debuginfo/basic-types-globals-metadata.rs stdout ----
NOTE: compiletest thinks it is using LLDB version 1400
NOTE: compiletest thinks it is using LLDB without native rust support

error: Error while running LLDB
status: exit status: 1
command: PYTHONPATH="/usr/lib/local/lib/python3.10/dist-packages" PYTHONUNBUFFERED="1" "/usr/bin/python3" "/home/gh-jyn514/rust/src/etc/lldb_batchmode.py" "/home/gh-jyn514/rust/build/aarch64-unknown-linux-gnu/test/debuginfo/basic-types-globals-metadata.lldb/a" "/home/gh-jyn514/rust/build/aarch64-unknown-linux-gnu/test/debuginfo/basic-types-globals-metadata.lldb/basic-types-globals-metadata.debugger.script"
--- stdout -------------------------------
LLDB batch-mode script
----------------------
Debugger commands script is '/home/gh-jyn514/rust/build/aarch64-unknown-linux-gnu/test/debuginfo/basic-types-globals-metadata.lldb/basic-types-globals-metadata.debugger.script'.
Target executable is '/home/gh-jyn514/rust/build/aarch64-unknown-linux-gnu/test/debuginfo/basic-types-globals-metadata.lldb/a'.
Current working directory is '/home/gh-jyn514/rust'
------------------------------------------
--- stderr -------------------------------
Traceback (most recent call last):
  File "/home/gh-jyn514/rust/src/etc/lldb_batchmode.py", line 184, in <module>
    debugger = lldb.SBDebugger.Create()
AttributeError: module 'lldb' has no attribute 'SBDebugger'
------------------------------------------

This is because of llvm/llvm-project#55575 I think, but setting PYTHONPATH=/usr/lib/llvm-14/lib/python3.10/dist-packages doesn't seem to have helped.

Track performance issues with CDNs

Once in a while, we get reports from users that they are experiencing performance issues with our CDNs. We currently neither have good visibility into the frequency of these issues nor do we have good tools to gather data. We will use this issue to collect reports and look for patterns.

Reports

Zulip thread	CDN	Reproducible
search incredibly slow for nightly-rustc docs	CloudFront	No
rustup is slow

doc.rust-lang.org: add caching headers

(moving from rust-lang/rust#82286)

Steps to reproduce:

Load any doc.rust-lang.org page, e.g. https://doc.rust-lang.org/std/str/index.html.
Open Developer Console.
Click the URL bar and hit enter to load the page again.

Expected result:

Most resources are cached and do not hit the network.

Actual result:

Many objects require a network fetch, but get a 304 Not Modified. These should be served with long caching headers.

Current headers:

$ curl -I https://doc.rust-lang.org/SourceCodePro-Regular.woff
HTTP/2 200 
content-type: font/woff
content-length: 55472
last-modified: Thu, 11 Feb 2021 14:27:46 GMT
x-amz-version-id: R.xOjlR7eKEsQwzm9EIrHs9eeCIDeXYq
server: AmazonS3
date: Thu, 18 Feb 2021 13:27:31 GMT
etag: "957fa8c030f8116bea59c13df470e4e8"
x-cache: Hit from cloudfront
via: 1.1 4b84530d7a095b58fb7a1d20b7f0cbe0.cloudfront.net (CloudFront)
x-amz-cf-pop: HIO50-C2
x-amz-cf-id: QpMYsW6BVcTPpDmi8Pg9HiMz_1iSOIpdYv60MMzNqVoV707yqsv_Dw==
age: 63023

Terraform state management

It is a recommended best practice to segment your terraform state so that you can limit the blast radius in the event of adverse modifications. How it is segmented is completely subjective, but I would suggest segmenting it based on service/functionality.

For example:

docs-rs
crates-io
rust-lang
highfive
global - things like Route53 zones, general IAM settings not specific to above services, etc.

This is just a rough first pass but it demonstrates the approach. Changes to docs-rs will not take down or affect any other services when changes to the infrastructure are made. It also speeds up the develop/deploy process as less resources are managed under each statefile and less time is traversing the statefile to compare and report.

Reach out to me at [email protected] and lets chat about overall infrastructure. I have been working with Terraform for over 4 years now and have built out/transformed several environments over the years and am willing to help here if there is interest.

Opening up UDP ports on Cloud Compute Servers

I got a request asking to open up UDP ports 60000-61000 so that mosh can work correctly. Is this something we can do? Any high security issues we should consider?

Also, tangentially, I was asked if we could set kernel.yama.ptrace_scope=0 so that ptrace can attach to existing processes. This is probably not as high of an ask as the UDP ports.

Count Downloads Using CDN Logs

Problem

crates.io counts downloads by crate and version. This is currently done as part of the /download endpoint, which counts the download and then redirects the caller to the Content Delivery Networks (CDNs) for static.crates.io, from where the actual file is downloaded.

sequenceDiagram
	User->>crates.io: Requests crate
	crates.io->>crates.io: Counts crate and version
	crates.io->>User: Redirects user to static.crates.io
	User->>static.crates.io: Requests crate
	static.crates.io->>User: Serves crate file

Due to the volume of requests to the /download endpoint, counting the crate and its version in the application has a significant performance cost. Especially when traffic spikes, the application can struggle to keep up with requests, which in the worst case can cause a service outage.

Goal

Key Objectives

Avoid hitting the web app for every crate download
Continue to count downloads by crate and version

Desired Outcome

In the ideal scenario, we avoid hitting the web app for download requests altogether and go straight to the CDNs. We can achieve this by changing the dl field in the index's config.json to point to the CDN instead of the application. Full compatibility with existing behavior requires to rewrite some URLs, which has already been implemented.

sequenceDiagram
	User->>static.crates.io: Requests crate
	static.crates.io->>User: Serves crate file

The CDNs could attempt to count download, but this is difficult because the CDNs are globally distributed. There is no single point that receives all the traffic, so download counts would need to be processed and merges somewhere else. That system would quickly face the same performance issues that crates.io currently faces.

We can use the request logs from the CDNs to count downloads in an asynchronous way. The CDNs produce a single log line per request. These logs are collected and uploaded periodically to a dedicated S3 bucket as a compressed archive.

Whenever a new archive is uploaded to the bucket, S3 can push an event into a SQS queue. crates.io can monitor the queue and pull incoming events. From the event, it can determine what files to fetch from S3, download and then parse them, and update the download counts in the database.

sequenceDiagram
	static.crates.io ->> S3: Uploads logs
	S3 ->> SQS: Queues event
	crates.io ->> SQS: Pulls event from queue
	crates.io ->> S3: Fetches new log file
	crates.io ->> crates.io: Parses log file
	crates.io ->> crates.io: Updates download counts

Benefits

Logs are processed asynchronously and in batches. This reduces the load on the server, especially during traffic spikes.
Publishing events into SQS is natively supported on AWS and does not require any additional infrastructure (besides an SQS queue).
crates.io is already integrated with S3 to manage crates. Its access can easily be extended to grant access to the SQS queue as well as the logs bucket.
Monitoring the queue and pulling from SQS can be implemented within the existing crates.io codebase. Alternative solutions required additional infrastructure and configuration, which would have fragmented the codebase and made long-term maintenance more difficult.

Notes

Logs from CloudFront and Fastly use a different format.
Compressed archives are typically between 5-20MB in size.

Tasks

Infra-Team

Create a new AWS account for crates.io
Create a new SQS queue
Enable publishing an event from S3 when a new archive is uploaded
Grant crates.io access to the SQS queue and the S3 bucket with the logs

crates.io

(Tracked by the crates.io team)

Create a job that monitors the SQS queue
Fetch and parse new log files
Update the counts in the database
Change dl field to point to the CDN

Resources

2023-11-02 Counting crate downloads in #t-infra on Zulip

Manage rust-lang/rust in team repository

The rust-lang/team repository provides a sustainable way to manage permissions for repositories on GitHub. The rust-lang/rust repository hasn't been added yet, because a large amount of teams and collaborators have access to it. Configuring the repository in the teams repo thus requires careful testing to avoid breaking any integrations or workflows.

Managing permissions by hand is tedious and error-prone, especially when it comes to removing access (which is often forgotten). We therefore want to manage the repository automatically.

This work needs to be done by an administrator who can run the sync-team locally and iterate on the configuration until it applies without breaking changes.

Resources

Discussion on Zulip about rust-lang/rust missing in teams repo

dev-desktop: automatically add git config username&email for all gh users

We can grab all this information from their profile (using the github email address that anonymizes your actual email address)

Spurious network errors for `index.crates.io`

We've received reports in Zulip that GitHub Actions sometimes fail with a network error when trying to download crates. Requests to index.crates.io time out, causing the builds to fail:

Run cargo build
    Updating crates.io index
warning: spurious network error (3 tries remaining): [28] Timeout was reached (Operation too slow. Less than 10 bytes/sec transferred the last 30 seconds)
warning: spurious network error (2 tries remaining): [28] Timeout was reached (Operation too slow. Less than 10 bytes/sec transferred the last 30 seconds)
warning: spurious network error (1 tries remaining): [28] Timeout was reached (Operation too slow. Less than 10 bytes/sec transferred the last 30 seconds)
error: failed to get `syn` as a dependency of package `serde_derive v1.0.183`
    ... which satisfies dependency `serde_derive = "=1.0.183"` of package `serde v1.0.183`
    ... which satisfies dependency `serde = "^1.0"` of package `backtrace v0.3.68 (D:\a\backtrace-rs\backtrace-rs)`
    ... which satisfies path dependency `backtrace` of package `cpp_smoke_test v0.1.0 (D:\a\backtrace-rs\backtrace-rs\crates\cpp_smoke_test)`

Caused by:
  failed to query replaced source registry `crates-io`

Caused by:
  download of 3/s/syn failed

Caused by:
  failed to download from https://index.crates.io/3/s/syn

Caused by:
  [28] Timeout was reached (Operation too slow. Less than 10 bytes/sec transferred the last 30 seconds)
Error: Process completed with exit code 1.

The failures so far always fail on the syn crate, but the given URL https://index.crates.io/3/s/syn works when testing locally.

Examples

References

crates.io spurious network errors topic in t-infra on Zulip
Timeout was reached topic in t-crates-io on Zulip
Timeout investigation topic in t-cargo on Zulip

Link stage0-sysroot in `link_rust.sh`

This is supported since rust-lang/rust#101691.

Relevant code:

simpleinfra/ansible/roles/dev-desktop/files/scripts/link_rust.sh

Line 28 in 75573f9

stages=(stage1 stage2)

Originally posted by @jyn514 in rust-lang/rustc-dev-guide#1615 (review)

Remove Travis and Azure configs

I don't think either the Travis config or Azure config is currently in use by any Rust repos, so we can probably just delete these.

Originally posted by @Mark-Simulacrum in #75 (comment)

Retry renewal of SSL certificates

The renew-ssl-certs.service fails periodically, which triggers the following alert in Grafana:

The systemd unit renew-ssl-certs.service on the <server>:9100 instance failed to execute.

The fix for this alert is to restart the service manually:

sudo systemctl restart renew-ssl-certs.service

We should investigate the reason why the service fails in the first place and retry automatically.

`cancel-outdated-builds` action can be replaced by `concurrency` workflow key(?)

See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#concurrency

Crates.io Index

The https://github.com/rust-lang/crates.io-index should be hosted in something like DynamoDB. DynamoDB is fast for many reads/writes at the same time, auto partitions based on key structure, can be configured with caching (DAX), as well as global tables for fast lookup anywhere in the world. Plus is this is what backes crates.io and that runs in AWS, makes sense to keep it all on the AWS network/backbone and not rely on Github's SLA

Remove dead code from `link_rust.sh`

rust-lang/rust#103286 was never merged and rust-lang/rust#101961 (comment) is blocked.

simpleinfra/ansible/roles/dev-desktop/files/scripts/link_rust.sh

Lines 25 to 26 in 75573f9

 if [ "$bootstrap_version" -gt 2 ]; then 

 stages=(stage1-sysroot stage2-sysroot)

Implement tooling to rollback a ECR service deployment

Currently, the way to rollback a service deployed to ECR is to push a new image and redeploy it, which can be unnecessarily slow.

ECR has support for retagging an image without uploading it again. We should develop a script that prompts which image the user wants to rollback, retags it as latest and forces a new ECS deployment.

Update AWS policies for Billing, Cost Management, and Account consoles

We have been getting notifications in the AWS console as well as reminder emails about an upcoming change to AWS' built-in policies. Relevant for us is that AWS is removing the aws-portal: namespace, which we use to grant access to the billing portal.

AWS provides mappings between the old policies and the more granular new ones. We need to review the differences and then update our Terraform configuration to use the new policies.

Resources

Announcement from AWS

Hello,

This is a reminder to update your policies to avoid changes to your access to AWS Billing, Cost Management and Account consoles. Our records indicate that you are still using retired actions to access these consoles.

If your policies are not updated with new actions by December 11, 2023, your users’ access to the AWS billing, Cost Management, and Account consoles will be affected.

The policies that need to be updated to include the new fine-grained actions are listed in the "Affected Resources" tab of the AWS Health Dashboard in the "Policy | Policy Name | Policy ARN | Type" format.

To help you with the migration, we have published a mapping between old and new actions in our user guide [1]. If you need to update policies across multiple member accounts in your organization, we have built bulk policy migration scripts to help you update all policies quickly and securely from your management account. See the bulk policy migration scripts user guide [2] for more information. You can find a detailed guide on how and which policies you need to update on our blog [3] and definitions of new IAM actions in Cost Management [4] and Billing [5] user guides.

AWS will not be able to grant further exceptions after December 11, 2023, so we strongly recommend that you act promptly to migrate your policies to new actions.

If you have more questions or need help make updates to your policies, please contact AWS Support [6].

[1] https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/migrate-granularaccess-iam-mapping-reference.html
[2] https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/migrate-iam-permissions.html
[3] https://aws.amazon.com/blogs/aws-cloud-financial-management/changes-to-aws-billing-cost-management-and-account-consoles-permissions/
[4] https://docs.aws.amazon.com/cost-management/latest/userguide/migrate-granularaccess-whatis.html#migrate-user-permissions
[5] https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/migrate-granularaccess-whatis.html
[6] https://aws.amazon.com/support

Sincerely,
Amazon Web Services

State resources should not be TF controlled

The S3 bucket and DynamoDB table used to store the backend state file(s) and control locking should not be controlled in Terraform. Its a cyclical dependency and if anything happens to those resources, it could be catastrophic.

Remove this file https://github.com/rust-lang/simpleinfra/blob/master/terraform/terraform-state.tf

Then run manually:

terraform state rm aws_s3_bucket.rust_terraform aws_dynamodb_table.terraform_state_lock

Update Node runtime for AWS Lambda functions

AWS has been notifying us of the upcoming end of support of the Node 16 runtime for Lambda functions. The functions will continue to work, but cannot be updated from August 2024.

We can update our functions to the latest Node 20 runtime.

Checking the AWS Health Dashboard and the code search on GitHub, we currently have four Lambda functions that are deployed using the Node 16 runtime.

Resources

Announcement from AWS

Hello,

We are contacting you as we have identified that your AWS Account currently has one or more AWS Lambda functions using the Node.js 16 runtime.

We are ending support for Node.js 16 in Lambda on June 12, 2024. This follows Node.js 16 End-Of-Life (EOL) reached on September 11, 2023 [1].

As described in the Lambda runtime support policy [2], end of support for language runtimes in Lambda happens in several stages. Starting on June 12, 2024, Lambda will no longer apply security patches and other updates to the Node.js 16 runtime used by Lambda functions, and functions using Node.js 16 will no longer be eligible for technical support. Also, Node.js 16 will no longer be available in the Console, although you can still create and update functions using Node.js 16 via AWS CloudFormation, the AWS CLI, AWS SAM, or other tools. Starting July 15, 2024, you will no longer be able to create new Lambda functions using the Node.js 16 runtime. Starting August 15, 2024, you will no longer be able to update existing functions using the Node.js 16 runtime.

We recommend that you upgrade your existing Node.js 16 functions to Node.js 20 before June 12, 2024. A list of your Node.js 16 functions can be found in the 'Affected resources' tab of the AWS Health Dashboard.

End of support does not impact function execution. Your functions will continue to run. However, they will be running on an unsupported runtime which is no longer maintained or patched by the AWS Lambda team.

This notification is generated for functions using the Node.js 16 runtime for the $LATEST function version. The following command shows how to use the AWS CLI [3] to list all functions in a specific region using Node.js 16, including published function versions. To find all such functions in your account, repeat this command for each Region:

aws lambda list-functions --region us-west-2 --output text --query "Functions[?Runtime=='nodejs16.x'].FunctionArn"

From 180 days before deprecation, you can also use Trusted Advisor to identify all functions using the Node.js 16 runtime, including published function versions [4].

If you have any concerns or require further assistance, please contact AWS Support [5].

[1] https://github.com/nodejs/Release
[2] https://docs.aws.amazon.com/lambda/latest/dg/runtime-support-policy.html
[3] https://aws.amazon.com/cli/
[4] https://docs.aws.amazon.com/awssupport/latest/user/security-checks.html#aws-lambda-functions-deprecated-runtimes
[5] https://aws.amazon.com/support

Sincerely,
Amazon Web Services

Enable users to choose their shell on dev-desktops

A recurring feature request for the dev-desktops has been the ability to customize a user's default shell. The following message in Zulip and the following discussion is a good example: https://rust-lang.zulipchat.com/#narrow/stream/242791-t-infra/topic/Cloud.20compute.20.2F.20Dev.20desktop.20feedback/near/307585274.

Proposed Design

The following is only a proposal and the final implementation may look different.

The simplest solution is probably to add a new configuration to our Ansible role for the dev-desktops that allows users to opt-in to an alternative shell:

# defaults/main.yml
vars_shells:
  gh-jdno: /usr/bin/fish
  gh-octocat: /usr/bin/zsh

The role should then loop over the users in the list, check whether the user actually exists, and if so set their default shell to the selected shell from the configuration file.

This allows users to create a pull request against this repository to change their shell. An admin still has to review the pull request and run Ansible, but this allows us to confirm that the user chose a valid shell.

It's be great to add user-facing documentation on how to change the shell to https://forge.rust-lang.org/infra/docs/dev-desktop.html (in the rust-lang/forge repository).

Alternatives

The discussion linked at the beginning of this issue proposes a solution based on the rust-lang/team repository and a new configuration block in a user's configuration file. While it would remove the need for admins to apply the configuration on the dev-desktops, this aproach would require significant changes to the schema of the team repository and the team_login binary that creates the user accounts on the dev-desktops.

This seems too complicated and time-consuming compared to a simpler solution with Ansible, given that the request to change a shell does not happen very frequently.

Implement persistent state across docker runs

The prod script for rust-central-station right now notably mounts a few volumes for persisting data across runs of the container.

This is useful for information like lets encrypt certificates, logs, secrets, etc. Right now rust-central-station also accesses some config files at runtime instead of at container build time, but we probably shouldn't support that.

Less access for docs-rs background builder to s3

Currently the background builder has complete access to the s3 buckets when it only need some write access (and no read access).

I tried moving to only allowing the following actions: PutObject, PutObjectTagging, PutObjectAcl, CreateBucket, but this failed in a strange way. All artifacts were created and successfully uploaded to the s3. The logs only showed one error which simply stated that the library in question could not be documented (even though it clearly was successfully being done so). The only visible effect of this error was that the crate was not removed from the build queue.

Connection reset by peer on static.rust-lang.org

Page(s) Affected

Most likely the same on all static.rust-lang.org, but this is the page I've been testing with:
https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256

What needs to be fixed?

We've noticed in the last couple of weeks that our CI pipelines started failing with "Connection reset by peer" when trying to install the nightly toolchain, specifically we've seen it when downloading these sha hashes.

This only happens when using IPV6.

I have noticed that static.rust-lang.org sometimes resolve to fastly and sometimes resolve to cloud front. From my testing this seem to only happens when it is resolved to fastly.

So my guess is the combination of fastly+ipv6 causes these errors.

The way I've been able to reproduce this issue, both from our build machines, but also from my computer, is to run a command like this:

while true; do curl -6 https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256 --resolve 'static.rust-lang.org:443:2a04:4e42:200::649'; sleep 1; done

The -6 assures that ipv6 is used, and the --resolve assures that a specific fastly ip is used (but the problem have been noted on all the different fastly ips).

Depending on what network I'm on, the dns doesn't always resolve static.rust-lang.org to fastly, not sure if that is dependent on location or something else, but the above curl will resolve to a fastly ip directly.

❯ while true; do curl -6 https://static.rust-lang.org/dist/2023-04-15/channel-rust-nightly.toml.sha256; sleep 1; done
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
curl: (35) Recv failure: Connection reset by peer
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
curl: (35) Recv failure: Connection reset by peer
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml
31b22ae424824f90e33b888b34d111849e583ed879d21cc113db54b42fd4477a  channel-rust-nightly.toml

The output from a failed curl with -vv added:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2a04:4e42:600::649:443...
* Connected to static.rust-lang.org (2a04:4e42:600::649) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* OpenSSL SSL_connect: Connection reset by peer in connection to static.rust-lang.org:443
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
* TLSv1.0 (OUT), TLS header, Unknown (21):
} [5 bytes data]
* TLSv1.3 (OUT), TLS alert, decode error (562):
} [2 bytes data]
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to static.rust-lang.org:443

PS. not sure if this is the correct place to report this?

Enable Brotli in CloudFront settings for docs.rust-lang.org

(Moving from rust-lang/rust#82286)

We can achieve a significant decrease in file size by enabling Brotli compression, which is an option in CloudFront: https://aws.amazon.com/about-aws/whats-new/2020/09/cloudfront-brotli-compression/

Brotli has been supported in major browsers since 2016: https://en.wikipedia.org/wiki/Brotli

search-index1.50.0.js is currently served as 272,247 gzip bytes, but it could be served as 140,170 brotli bytes.

$ curl https://doc.rust-lang.org/search-index1.50.0.js --header "Accept-Encoding: gzip" -o search-index.js.gz
$ zcat search-index.js.gz | brotli > search-index.js.br
$ ls -l search-index.js.*
-rw-rw-r-- 1 jsha jsha 140170 Feb 18 23:56 search-index.js.br
-rw-rw-r-- 1 jsha jsha 272247 Feb 18 23:56 search-index.js.gz

azure-configs: deploy fails when using Ubuntu 18.04

I updated Ubuntu image to 18.04 from 16.04 in this commit: rust-lang/libc@92ae2be#diff-ff3fa021ec930a60c3eb9896d71cd851R15

But deploy fails in azure-configs/static-websites.yml@rustinfra template (log).

Error says:

/bin/bash --noprofile --norc /home/vsts/work/_temp/0021839b-8dee-420a-83e4-9feb70c90223.sh
touch: cannot touch '/.nojekyll': Permission denied

##[error]Bash exited with code '1'.

I guess the template needs some tweaks when used on 18.04.
It isn't a big deal but it'd be great if we could use it on 18.04.

Please update the format used for build config imports on Travis CI

rust-lang has been one of the early alpha adopters of our build config import feature.

Various repositories are still using an early format for build config sources:

owner/repo/path/to/file.yml

During the product development process this has been superseded by the format:

owner/repo:path/to/file.yml

The format has been publicly announced in Nov 2019 with the feature.

The format you are currently using is not available any more for public use, rust-lang is currently safelisted for this outdated format, so your imports continue to work. However, in future versions of the build config processing component will not be able to support this.

Please update .travis.yml files that are importing config sources using the outdated format to use the new format:

Pretty sure I'm missing other places ...

Also mentioned on:

Btw, if you are importing a single config source only then you can specify it as a string (as opposed to a sequence of strings)

import: rust-lang/simpleinfra:travis-configs/static-websites.yml

The sequence you have will work just fine of course, too:

import: 
- rust-lang/simpleinfra:travis-configs/static-websites.yml

Break up docs-rs builder ansible into many roles

Currently the docs-rs builder ansible playbook is just one large role. The tasks can be broken up into reusable roles that can be used across the various other ansible playbooks (e.g., playground which does similar things to the docs-rs builder).

Secret management

Managing secrets is hard! Some thoughts:

A master list of what secrets are needed is a nice-to-have
Comments on what each secret is and how it's generated is also a nice-to-have
Ability to run in dev with "dummy" secrets is a nice-to-have

As an example rust-central-station has a toml file with a program to parse toml on the command line. This file is just stored on the machine itself and filled in manually. Adding new secrets is sorta painful.

Unsure how to best handle this!

Setup unattended upgrades in Ansible

Add CI to this repo

Also the more we add to this repo it might be good to set up CI for it before too long.

Originally posted by @alexcrichton in #12 (comment)

possibility to install emacs >= 28 inside the servers

emacs is painful for remote development, and for some users that use a personal config and it uses the terminal view, would be very nice to have it installed by default.

If I'm understanding correctly, if we have the emacs inside the server, as a user I should be able to run just fine, right?

git clone --depth 1 https://github.com/doomemacs/doomemacs ~/.emacs.d
~/.emacs.d/bin/doom install

Sign releases with something stronger than SHA1

SHA1 is now rejected by sequoia and rustup

See https://www.reddit.com/r/rust/comments/10qlf1q/nightly_dc1d9d50f_20230131_signature_verification/ and rust-lang/rustup#3185

Steps to reproduce the issue:

cargo install sequoia-sqv
curl -O https://github.com/rust-lang/rustup/blob/master/src/rust-key.pgp.ascii
curl -O https://static.rust-lang.org/dist/channel-rust-nightly.toml
curl -O https://static.rust-lang.org/dist/channel-rust-nightly.toml.asc
sqv --keyring rust-key.pgp.ascii channel-rust-nightly.toml{.asc,}
Signing key on 108F66205EAEB0AAA8DD5E1C85AB96E6FA1BE5FE is not bound:
           No binding signature at time 2023-02-01T00:44:45Z
  because: Policy rejected non-revocation signature (PositiveCertification) requiring second pre-image resistance
  because: SHA1 is not considered secure since 2023-02-01T00:00:00Z

To fix: switch to a stronger digest
Not sure where gpg is called exactly, somewhere in one of the promote-release scripts?

RCS and usernames

Right now the user on rust-central-station is just the bland "ubuntu" whereas I'm normally "acrichton", "acrichto", or "alex" locally. The script for restarting RCS starts out uses $USER to ssh, but perhaps we could rely on ~/.ssh/config to configure that? (that's what I've done in the past)

Slow download speed for uncached artifacts on Fastly

We've had reports of slow download speeds for certain artifacts two or more times in Zulip now. The working theory is that artifacts that are not cached on Fastly are quite slow to download the first time, but fast once they are cached.

This problem affects contributors who use cargo-bisect-rustc the most, since that tool downloads a number of nightlies and tests them. Depending on the nightlies, chances are high that those are not cached. And since multiple nightlies get downloaded sequentially, the issue is felt more severely.

Zulip threads

rustc nightly download speed

	if [ "$bootstrap_version" -gt 2 ]; then
	stages=(stage1-sysroot stage2-sysroot)

rust-lang / simpleinfra Goto Github PK

simpleinfra's Introduction

Simpleinfra

setup-deploy-keys

with-rust-key

simpleinfra's People

Contributors

Stargazers

Watchers

Forkers

simpleinfra's Issues

Questions

Reports

Problem

Goal

Key Objectives

Desired Outcome

Benefits

Notes

Tasks

Infra-Team

crates.io

Resources

Resources

Examples

References

Resources

Resources

Proposed Design

Alternatives

Page(s) Affected

What needs to be fixed?

Zulip threads

Recommend Projects

Recommend Topics

Recommend Org