Giter Club home page Giter Club logo

infrastructure's Introduction

Infrastructure

Mission Statement

To provide infrastructure for the Adoptium farm that is:

  • Secure - Infrastructure is private by default and access is granted in a time and access control limited manner.
  • Consistent - Infrastructure is consistent in order to produce consistent AdoptOpenJDK binaries.
  • Repeatable - Infrastructure can be reproduced by our infrastructure as code. We embrace the Chaos Monkey.
  • Auditable - What each host/platform is made up of is publicly accessible infrastructure as code.

The end result should be immutable hosts, which can be destroyed and reproduced from Ansible playbooks. See our Contribution Guidelines on how we implement these goals.

Can we Chaos Monkey it

See our current Chaos Monkey Status.

Related Repositories

Important Documentation

Contributing

Please visit our #infrastructure Slack Channel and say hello. Please read our Contribution Guidelines before submitting Pull Requests.

Members

We list administrative members and their organisation affiliation for maximum transparency. Want to add a new member? Please follow our Onboarding Process. If you want access for yourself, raise an issue in this repository for the team to consider it - if you are working on an issue here we will generally be happy to add you to the triage team.

* Indicates access to the secrets repo

Members of this team that holds super user access to our machines to perform maintenance

The primary infrastructure team who manage issues and PRs in this repository. People in this team are committers and able to merge pull requests in this repository. In general if you need assistance from a committer, please post a message into the #infrastructure slack channel where one of the committers should be able to help rather than attempting to contact someone directly.

This team is the starting point for new members.

People in this team can take ownership of issues but do not have the privileges to merge pull requests. In general new people in the team will go into this group for a while before being granted additional access.

Infrastructure Providers

The Adoptium project is proud to receive contributions from many companies, both in the form of monetary contributions in exchange for membership or in-kind contributions for required resources. The Infrastructure collaborates with the following companies who contribute various kinds of cloud and physical hardware to the Adoptium project.

Infra Sponsors Page

Host Information

Most information about our machines can be found at Inventory This file is important not only as a reference for the team, but is used by AWX which we often use to deploy ansible playbooks so it is important that it is kept up to date

Maintenance Window Schedule

We will aim to perform routine maintenance on the first Tuesday of each month, generally between 1000-1200 (UTC). This will be announced in the infrastructure channel on slack on the day prior to the maintenance. This timing should typically avoid coinciding with release work, although if a release in the previous month is ongoing then the window can be delayed til the following Tuesday.

Jenkins and it's plugins will be updated to the latest LTS every month. Other services such as Bastillion, AWX, and Nagios will be updated as required on a quarterly basis (On the first month of each quarter) during the same window if required for security reasons. In some cases we may wish to do an out-of-bound patch if a sufficientl sever issue is identified.

Standard Action Items

Jenkins

  1. Ensure off-machine backups are working!
  2. Check for plugin updates that will apply to the current version of jenkins (Each plugin should be checked for potential issues in the readme)
  3. Repeat step 1 if necessary until jenkins does not offer any more plugins
  4. Identify new LTS level - check the release notes to identify any potential problems. Allow jenkins to upgrade itself
  5. Redo step 1/2 so that any plugins that were unable to be updated due to the older jenkins level can update themselves.
  6. If necessary, and the remediation cannot be performed within the maintenance window, identify potentially risky plugins that were held back and create an issue to deal with them in the next cycle.
  7. Backup the main war in /usr/share/jenkins to a name with a version suffix in case of corruption to the main jar.

Backups

These are taken on a daily basis, and one per month is currently kept "forever" on our backup server. Details are now in a separate document

OS Patch Management

  • Nagios is configured to monitor each system and report on the status of OS patches required so we can identify if any system is not self-updating
  • Non-infrastructure systems are configured by ansible to automatically apply all patches. (Sundays at 5am local host time) where possible
  • Infrastructure systems are configured to automatically apply security patches only. (Sundays at 5am local host time) This information is logged on the localhost: /var/log/apt-security-updates
  • We do not currently schedule outages to reboot to pick up new kernels.

infrastructure's People

Contributors

aahlenst avatar adambrousseau avatar aixtools avatar ali-ince avatar aswinkr77 avatar bblondin avatar cjkwork avatar cwesmills avatar dependabot[bot] avatar fredg02 avatar gdams avatar geraintwjones avatar haroon-khel avatar husainyusufali avatar jdekonin avatar julian55455 avatar karianna avatar luhenry avatar lumuchris256 avatar mbarbero avatar neomatrix369 avatar olvap377 avatar pstankie avatar sej-jackson avatar steelhead31 avatar sxa avatar sxa555 avatar vsebe avatar willsparker avatar zdtsw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infrastructure's Issues

We do not currently have any sles12 s390x machines tagged with "test"

At the moment we run the systemtests on sles12 as this has a suitable version of the libffi library available (https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/19). At present none of those machines has a a tag of test so I cannot use that tag to run the systemtest jobs. If I leave the tag off I end up with things potentially running on master which doesn't work well at all as it doesn't have make installed. I could use build that would stop the other platforms from using the dedicated test machines, so that's not a sensible solutionest either. For now I've set the jobs to use !hg which knocks out three machines including master and is adequate until we get a sles12/s390x box tagged with test

See also the work item about machines tags: #93

ci.adoptopenjdk.net package upgrade problems

The Jenkins host ci.adoptopenjdk.net had a number of critical OS package updates pending. Upgrading the packages has introduced problems with Jenkins.

Jenkins is up and running, but a number of nodes are currently flagged as offline.

pLinux-LE machine for JCK testing

We need to run the JCK suite on pLinux-LE and it's access needs to be locked down so it cannot be shared with other jobs.

Spec-wise something like 2-core, 8Gb RAM and an SSD of around 100Gb would be ideal as that's what we're using for xLinux.

build-marist-s390x-rhel-7.3 unable to resolve host

I am having issues using wget or curl on this machine. This is also preventing me from being able to connect it to jenkins

~ ssh [email protected]
[linux1@adoptopenjdk ~]$ wget https://google.com
--2017-06-26 04:02:53--  https://google.com/
Resolving google.com (google.com)... failed: Connection refused.
wget: unable to resolve host address ‘google.com’
[linux1@adoptopenjdk ~]$ 

CC @bblondin @AdoptOpenJDK/getopenjdk

sigtest: Ubuntu machines need to have JDK 5, JDK 6 and JDK 9

In order to be able to build certain artefacts i.e. code-tools related (e.g. SigTest) we need to have the following JDKs installed:

  • version 5
  • version 6
  • version 9

As version 7 and 8 are already installed.

Note OpenJDK version 5 and 6 are not easily installable via ansible scripts. So an alternative source will need to be sought after. Which might add to the complications of our artefacts being built using different flavours of JDK (a bit inconsistent).

We could download from Oracle but with the latest changes, we will need a login and password to be able to download old versions of the JDK. Which means passing these details into the ansible script.

zLinux machine for JCK testing

We need to run the JCK suite on zLinux and it's access needs to be locked down so it cannot be shared with other jobs.

Spec wise a 2-core, 8Gb, and a fast disk of around 100Gb should be ideal.

text/csv.pm

Text::CSV is installed via ansible playbook:

    - name: Install Text::CSV
      shell: |
        cpanm --with-recommends Text::CSV
      tags: text_csv 

However, it doesn't appear on the machine

[root@rhel7hcxrt1 ~]# perl -MText::CVS -e 1
Can't locate Text/CVS.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .).

Running the installation command manually says its already installed and up to date:

[root@rhel7hcxrt1 ~]# cpanm --with-recommends Text::CSV
Text::CSV is up to date. (1.95)

Issues has been seen everywhere we checked so far:
RHEL 6 PPC64, RHEL 7 x86 PPC64, UB 14/16 x86,

Investigate running Jenkins master as a service

Launching Jenkins currently requires remembering a long command-line. To keep things simple it would be preferable to embody this as a system service or some such thing, so that it will start on normal machine boot level, and be easier to stop/restart etc.

Original suggestion by @karianna

Get a Tier 1 x86 sponsor

Currently we are hosting a lot of our x86 hardware with packet.net

I want to move away from this as I want to free up our usage limit so that we can provision more arm machines for testing

FYI @vielmetti

Vagrant script for ubuntu fails when run in an isolated environment

Standing up an environment using vagrant (for ubuntu 14.04) and running the ubuntu.yml Ansible script halts with the below message:

fatal: [localhost]: FAILED! => {"failed": true, "msg": "An unhandled exception occurred while
running the lookup plugin 'file'. Error was a <class 'ansible.errors.AnsibleError'>, 
original message: could not locate file in lookup: /Vendor_Files/keys/id_rsa.pub"})

This occurs when run on a local machine (reproducible on Linux and MacOSX environments).

Re-running with -v flags will help, -vv, -vvv, etc... will give will more verbose info about the issue.

See #58 (comment) in #58

Add any additional info to https://github.com/AdoptOpenJDK/openjdk-infrastructure/blob/master/ansible/README.md, once resolved or any findings during the course of the investigation.

#helpwanted #bug

pLinux-LE machines for all non-JCK testing

I will piggy-back on the requests for JCK test machines (asking for same requirements as #76),
"Spec-wise something like 2-core, 8Gb RAM and an SSD of around 100Gb ".

One (or eventually two) machines, so that we can enable the following types of tests:

  • openjdk regression tests
  • system/stress tests
  • functional tests

(optionally/eventually some perf micro benchmarks).

Jenkins server - root partition is almost full

The Jenkins server (http://ci.adoptopenjdk.net) root partition is almost full.
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 188G 142G 37G 80% /

101GB of this is in /home/jenkins/.jenkins/jobs

Looking at the builds, It does not appear that they are configured to clean themselves up.
This should be configured for all builds.

example:
screen shot 2017-07-10 at 4 21 36 pm

Add additional hosts and services to Nagios

The following machines are not currently known to our Nagios installation, and should be added to ensure their basic health is monitored:

  • api.adoptopenjdk.net
  • staging.adoptopenjdk.net

The following publicly available services should also be monitored so the #infrastructure channel is notified if they go down:

  • HTTP/HTTPS
    • www.adoptopenjdk.net
    • api.adoptopenjdk.net
    • ci.adoptopenjdk.net
    • keybox.adoptopenjdk.net
    • staging.adoptopenjdk.net
    • ansible.adoptopenjdk.net

Include host time synchronization pkgs in Ansible scripts

New machines that are configured for AdoptOpenJDK should have some real time clock synchronization package installed (e.g. NTP, timesyncd, etc) to ensure they do not drift too far and disrupt Jenkins pipeline coordination.

Although many of our jobs are quite long running, where they fail they may fail quickly and being out of sync by tens of seconds matters.

nagios.adoptopenjdk.net certificate about to expire

Hello,

Your certificate (or certificates) for the names listed below will expire in
19 days (on 26 Jul 17 00:42 +0000). Please make sure to renew
your certificate before then, or visitors to your website will encounter errors.

nagios.adoptopenjdk.net

For any questions or support, please visit https://community.letsencrypt.org/.
Unfortunately, we can't provide support by email.

For details about when we send these emails, please visit
https://letsencrypt.org/docs/expiration-emails/. In particular, note
that this reminder email is still sent if you've obtained a slightly
different certificate by adding or removing names. If you've replaced
this certificate with a newer one that covers more or fewer names than
the list above, you may be able to ignore this message.

If you want to stop receiving all email from this address, click
http://mandrillapp.com/track/unsub.php?u=30850198&id=8fb004715c47471b98c23130d1ca600a.OYLci%2Fk79LBUOvvM5JFmpLp8Mdw%3D&r=https%3A%2F%2Fmandrillapp.com%2Funsub%3Fmd_email%3Dbrad_blondin%2540ca.ibm.com
(Warning: this is a one-click action that cannot be undone)

Regards,
The Let's Encrypt Team

build-marist-s390x-sles-12 can't resolve itself

I'm getting an issue on build-marist-s390x-sles-12 (148.100.110.56) where it is unable to resolve it's own hostname. Can we get an entry for openjdk-sles12 (The output from hostname) added to /etc/hosts on the machine - either with it's real IP or just to 127.0.0.1 please so that it resolves? This is causing some tests to fail as per adoptium/aqa-systemtest#9

Move automated posting to Slack into their own channels

We have a number of automated 'bots' that post to Slack about various topics.

The bots are swamping some channels with automated messages, and hiding any real post. It is also unnecessary to archive most of the bot postings, so we can choose which are archived.

This issue is to create #<blah>-bots channels and switch the bots to posting on there so the humans have a chance.

Biggest offenders are likely:
#infrastructure where Nagios should be posting to #infrastructure-bot (un-archived), and
#website where Localize should be posting to #website-bot (un-archived).

Free up disk space on build-marist-s390x-sles-12 root partition

The /dev/dasdb2 file system on build-marist-s390x-sles-12 (148.100.110.56) is filling up, currently at 96%.

On the Ubuntu sister machine, the system upgrades had multiple versions of the kernel left behind. That may be happening on SLES too.

This task is to clear out any unused packages and kernels to free up the root partition.

Jenkins machine configuration on Windows test machines need to update

Openjdk tests build on windows got failures for permission issue:

java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\openjdk_test_x86-64_windows\openjdk-test\OpenJDK_Playlist\openjdk-jdk8u\jdk\test\sun\management\windows\revokeall.exe

According to last two comments in adoptium/aqa-tests#37 (comment) Jenkins machine configuration need to update to specify the tools location for git.

The issue are still there suppose the configuration isn't be updated.

Windows machine for JCK testing

We need to run the JCK suite on WIndows and it's access needs to be locked down so it cannot be shared with other jobs.

Needs to have a fast disk (so I'd say SSD) and ideally powerful cores (but doesn't need many of them) so something like 2 core/8Gb/100Gb SSD should suffice. Perhaps 16Gb+ if we decide to use a ramdrive for holding the JCK test suite itself.

Windows version TBD - what do Oracle test on?

Bring AIX boxes on-line for build / test

The following new AIX build / test boxes are available to the project. I have added the keybox public key to the list of authorized_keys. Note that there are existing authorized keys that should be retained for the hoster's maintenance use.

power8-aix-openjdk1.osuosl.org - 140.211.9.10
power8-aix-openjdk2.osuosl.org - 140.211.9.12

Each system is 32GB memory, 5 vCPU, 1 CPU unit that can dynamically adapt to 10 CPU, and a minimal AIX 7.1 install. The AIX 7.1.4.4 DVD1 still is "mounted". The OS and related files are installed on filesystems allocated from rootvg, and /home is allocated from homevg. Each volume group is 80GB and most of rootvg is unallocated with considerable room for expansion.

Both systems have been set up with larger queue depth for the hdisks, which improves performance a little. One also can create a ramdisk.

You can customize the systems as you wish.

jenkins: Add ability to let more users view the jenkins job configurations

I've had a few people as if they can see the job configuration to be able to understand what the jobs are doing. While jenkins doesn't have any integrated ability to allow read-only access (aso by default if you can view it you edit it) there are plugins such as https://wiki.jenkins.io/display/JENKINS/Extended+Read+Permission+Plugin which will change that. opening this issue for discussion to see if there is any reason not to have this in place - do we have sensitive stuff in the jobs that wouldn't be hidden by this plugin?

Create missing Ansible playbooks for build machines

To ensure that machine images can be reliably recreated for AdoptOpenJDK build/test we need entirely scripted configuration that sets up a VM "from scratch".

A number of the Ansible playbooks exist in the openjdk-build repo, but they are not complete in their coverage.

Proposed steps are:

  • create an initial provisioning script to establish sufficient capability on a new node type to run as an Ansible client (e.g. keybox public key, python, more?)
  • ensure Ansible scripts exist for each CPU/OS type we manage, and are complete.

Goal is that we can discard a VM at any point and recreate it entirely using the public information in our scripts.

Get a second windows build machine

We could do with a windows 2012 server with visual studio 2013 to build the openj9 binaries as this is the required level for openj9 to build. I will investigate where we could source one from.

Update Nagios to latest version

Our installation of Nagios Core 4.3.1 is outdated and should be upgraded. The latest version of Nagios Core is 4.3.4 was released on 2017-08-24.

Add new s390x Linux machines to build test farm

Marist have generously created two new Ubuntu 16.04 systems for us. One is the replacement for our old RHEL6 image (148.100.110.55) while the other is an extra one that we requested to cope with the additional workload.

Both images have 8 Gig Memory / 100G Disk / 4 CP's

Systems:
LXEOJ905 - 148.100.33.178
LXEOJ906 - 148.100.33.179

I have the login details for these for those that need them.

This task is to configure the machines for build/test as appropriate, add the new nodes to Jenkins, Nagios, etc.

zLinux machine(s) for non-JCK testing

Request for minimally 1 (eventually 2, if we do not start sharing machines across build/test functionality) zLinux machines for JCK testing (similar request to #77), "Spec wise a 2-core, 8Gb, and a fast disk of around 100Gb".

MacOS machine for JCK testing

Agreed that Macstadium will provide us with two further mac's for this purpose. Waiting to find out which os level to deploy.

Create a Nagios System Configuration Tool

Create a Nagios System Configuration Tool (script) to help setup/configure new systems host.cfg files for Nagios to monitor

ask questions then generate the host.cfg file
test and enable monitoring

Request for access to Packet ARM systems

I'm working on getting the OpenJDK/OpenJ9 builds working on ARM. Would it be possible to get access to the ARM build systems for some basic toe-in-the-water evaluations of my initial builds?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.