Giter Club home page Giter Club logo

houndigrade's Introduction

cloudigrade

license Tests Status codecov

What is cloudigrade?

cloudigrade was an open-source suite of tools for tracking RHEL use in public cloud platforms, but in early 2023, its features were stripped-down to support only the minimum needs of other internal Red Hat services. The 1.0.0 tag marks the last fully-functional version of cloudigrade.

What did cloudigrade do?

cloudigrade actively checked a user's account in a particular cloud for running instances, tracked when instances were powered on, determined if RHEL was installed on them, and provided the ability to generate reports to see how many cloud compute resources had been used in a given time period.

See the cloudigrade wiki or the following documents for more details:

What are "Doppler" and "Cloud Meter"?

Doppler was an early code name for cloudigrade. Cloud Meter is a product-ized Red Hat name for its running cloudigrade service. cloudigrade == Doppler == Cloud Meter for all intents and purposes. 😉

houndigrade's People

Contributors

abaiken avatar abellotti avatar elyezer avatar infinitewarp avatar kholdaway avatar pyup-bot avatar renovate[bot] avatar werwty avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

houndigrade's Issues

No such file or directory: 'mount'

In GitLab by @infinitewarp on Aug 10, 2020, 10:23

Summary

Sometimes houndigrade raises FileNotFoundError unexpectedly because the mount command appears to have gone missing. It is unclear how this is possible since mount should always be part of the base image.

Steps to Reproduce

  1. Generate activity such that houndigrade runs to inspect an AMI.
  2. Unknown... 😕

Expected Result

  • Inspection processes normally.
  • A message for each image is posted to the configured SQS "inspection results" queue.
  • houndigrade ECS scales down to 0.

Actual Result

  • Inspection blows up unexpectedly.
  • No message is posted to SQS.
  • houndigrade ECS stays scaled up to 1 forever (or until the reaper forces it down).

Additional context

  • QE people will at a minimum run regression tests around the inspection process to verify that we did not regress any previously-working functionality.

  • If we find reliable steps to reproduce the problem, devs discuss with QEs during development to determine if it's possible and reasonable for QE to build new test for that problem.

  • Is there a way to put the inspection startup in a loop and capture the logs to see if it's possible to recreate and observe an actual error run like this?

    • @infinitewarp thinks this would be difficult with AWS due to how we configure the ECS task to delete volumes upon completion, but it's worth taking a few minutes to investigate and try. Don't get stuck on this if it's not straightforward.
  • What should we do to better handle this if we can't find the cause?

  • Should we catch this exception, report something interesting to stdout so it's logged, and report the images back to SQS with "error" state so houndigrade can cleanly scale down?

  • Should we have a new/different message format for the inspection results queue to indicate that houndigrade failed to run but the images are neither error nor inspected?

    • Yes, let's do this. At the very least, it gets us more information that we desperately need.
  • Should we simply check that mount exists and is executable before we call it?

    • This should not be necessary since the image should be 100% static, but checking before would allow us to report an error and cleanly exit.
    • If we do this, we need to put some kind of message on the results SQS queue to tell cloudigrade that houndigrade encountered an unrecoverable error and could not inspect the listed image.
  • Should we think of some new way to back off and retry running the houndigrade task?

    • No we don't want to back off and retry.
  • CloudWatch logs in our production AWS account indicate a houndigrade run at 2020-08-07T18:49:54.269Z for image ami-0ff6d47107892dd85 failed with this error, but the next run at 2020-08-07T19:49:25.510Z for images ami-00ff566775b0b66e1 and ami-0ff6d47107892dd85 did not fail.

  • Note that the second run included the same image as the first run.

  • Log of first run that raised this exception:

--------------------------------------------------------------------------------------------------------
|   timestamp   |                                       message                                        |
|---------------|--------------------------------------------------------------------------------------|
| 1596826194262 | Provided cloud: aws                                                                  |
| 1596826194262 | Provided drive(s) to inspect: (('ami-0ff6d47107892dd85', '/dev/xvdba'),)             |
| 1596826194262 | Checking drive /dev/xvdba                                                            |
| 1596826194263 | Checking partition /dev/xvdba1 for image ami-0ff6d47107892dd85                       |
| 1596826194269 | Traceback (most recent call last):                                                   |
| 1596826194269 |   File "cli.py", line 626, in <module>                                               |
| 1596826194269 |     main()                                                                           |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ |
| 1596826194269 |     return self.main(*args, **kwargs)                                                |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main     |
| 1596826194269 |     rv = self.invoke(ctx)                                                            |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke  |
| 1596826194269 |     return ctx.invoke(self.callback, **ctx.params)                                   |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke   |
| 1596826194269 |     return callback(*args, **kwargs)                                                 |
| 1596826194269 |   File "cli.py", line 72, in main                                                    |
| 1596826194269 |     mount_and_inspect(drive, image_id, results, debug)                               |
| 1596826194269 |   File "cli.py", line 133, in mount_and_inspect                                      |
| 1596826194269 |     with mount(partition, INSPECT_PATH):                                             |
| 1596826194269 |   File "/usr/lib64/python3.8/contextlib.py", line 113, in __enter__                  |
| 1596826194269 |     return next(self.gen)                                                            |
| 1596826194269 |   File "cli.py", line 216, in mount                                                  |
| 1596826194269 |     mount_result = subprocess.run(                                                   |
| 1596826194269 |   File "/usr/lib64/python3.8/subprocess.py", line 489, in run                        |
| 1596826194269 |     with Popen(*popenargs, **kwargs) as process:                                     |
| 1596826194269 |   File "/usr/lib64/python3.8/subprocess.py", line 854, in __init__                   |
| 1596826194269 |     self._execute_child(args, executable, preexec_fn, close_fds,                     |
| 1596826194269 |   File "/usr/lib64/python3.8/subprocess.py", line 1702, in _execute_child            |
| 1596826194269 |     raise child_exception_type(errno_num, err_msg, err_filename)                     |
| 1596826194269 | FileNotFoundError: [Errno 2] No such file or directory: 'mount'                      |
--------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                     message                                                      |
|---------------|------------------------------------------------------------------------------------------------------------------|
| 1596829764189 | Provided cloud: aws                                                                                              |
| 1596829764189 | Provided drive(s) to inspect: (('ami-00ff566775b0b66e1', '/dev/xvdba'), ('ami-0ff6d47107892dd85', '/dev/xvdbb')) |
| 1596829764189 | Checking drive /dev/xvdba                                                                                        |
| 1596829764190 | Checking partition /dev/xvdba1                                                                                   |
| 1596829764420 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764421 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764421 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764422 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764423 | RHEL not found via product certificate on: /dev/xvdba1                                                           |
| 1596829764435 | RHEL not found via enabled repos on: /dev/xvdba1                                                                 |
| 1596829764960 | RHEL not found via signed packages on: /dev/xvdba1                                                               |
| 1596829764960 | No syspurpose.json file found on: /dev/xvdba1                                                                    |
| 1596829764961 | RHEL not found on: ami-00ff566775b0b66e1                                                                         |
| 1596829764994 | Checking drive /dev/xvdbb                                                                                        |
| 1596829764995 | Checking partition /dev/xvdbb1                                                                                   |
| 1596829765048 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765049 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765049 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765050 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765051 | RHEL found via product certificate on: /dev/xvdbb1                                                               |
| 1596829765060 | RHEL not found via enabled repos on: /dev/xvdbb1                                                                 |
| 1596829765508 | RHEL not found via signed packages on: /dev/xvdbb1                                                               |
| 1596829765510 | RHEL (version 8.0) found on: ami-0ff6d47107892dd85                                                               |
------------------------------------------------------------------------------------------------------------------------------------

Identify the linux distro in an attached volume

@infinitewarp commented on Fri Mar 02 2018

As a Doppler admin, I want to inspect a volume I've created from an AMI and tell Doppler what linux distro is on it so Doppler can correctly separate usage by distro when producing reports.

Acceptance Criteria

  • Verify that a new message with information about the distro is queued (going to the handler) when I run an inspector instance with an attached volume that came from an AMI containing RHEL.
    • Verify the message includes the release name (e.g. Red Hat Enterprise Linux Server release 7.3 (Maipo))
    • Verify the message includes a true boolean indicator that the distro is RHEL.
    • Verify the message includes sufficient information for the handler to identify the AMI ID from which the volume originated.
  • Verify that a new message with information about the distro is queued (going to the handler) when I run an inspector instance with an attached volume that came from an AMI not containing RHEL.
    • Verify the message includes the release name
    • Verify the message includes a false boolean indicator that the distro is not RHEL.
    • Verify the message includes sufficient information for the handler to identify the AMI ID from which the volume originated.
  • Verify that no message is written and the instance dies cleanly when I run an inspector instance with an attached volume with a nonstandard linux filesystem (e.g. LVM)

Tech assumptions

  • The output of this story includes a container image that we can run in the various clouds (AWS, Azure, GCP), and it includes a script that runs at start.
  • Just use mount to mount the attached volume in the running instance. Assume mount can figure out the filesystem for now, and if it can't, the instance should die gracefully and does not need to report failure.
  • Queue URI and credentials will be provided to the hound via environment variables.

Houndigrade is not writing results to the sqs queue.

Summary

After launching an ecs task to inspect images I expect houndigrade to write the results to the sqs queue. Houndigrade happily exits, but not message is written.

Steps to Reproduce

  1. Add a new account and have cloudigrade do all prep and kick off an inspection job
  2. Watch houndigrade task finish

Expected Result

  • There is a results payload in the namespace-inspection_results queue that can be picked up by the celery task to process it

Actual Result

  • There is no payload, or queue to match that name, even though it should be auto-created.

Additional context

Inspection process hanging in `preparing` or `inspecting`

In GitLab by @mpierce on Jul 30, 2020, 14:53

Summary

When I

  • run one instance each of two different x86_64 and/or arm64 images that should be inspected and then recognized as RHEL,
  • then create a source and thereby a Cloud Meter account,

Houndigrade's inspection process hangs up and doesn't finish, leaving instances in either status:preparing or status:inspecting

Steps to Reproduce

  1. Launch an instance each of the below two images
  2. Create a Source, Cloud Meter account
  3. Wait for inspection process (should not take more than 30 minutes (really 15, but just to be sure)).
  4. Check facts returned by Cloud Meter about each instance
    Image(
        'ami-0d202a94482e3d42c',
        'x86_64',
        't2.micro',
        'Premium',
        'Development/Test',
        'RHEL-x86_64-Workstation-premium',
    ),
    Image(
        'ami-0169e18bfd191f4e6',
        'x86_64',
        't2.micro',
        'Self_support',
        'Disaster Recovery',
        'RHEL-x86_64-Server-self-support',
    ),
    Image(
        'ami-06d8921b0a9cd7c9c',
        'arm64',
        'a1.medium',
        '',
        'Red Hat Enterprise Linux Server',
        'RHEL x86_64 Server no-sla for inspections', # x86_64 is a lie here, but I can't change the name
        ),
    Image( 
        'ami-054fcc5334b1abd1c',
        'arm64',
        'a1.medium',
        'self-support',
        'Red Hat Enterprise Linux Server',
        'RHEL x86_64 Server Self-support for inspections', # x86_64 is a lie here, but I can't change the name
        ),

These images are in the DEV07 account.
Much greater detail about each of the instances is available in the attached files.

Expected Result

  • Within a reasonable amount of time (< 20 min?), images are:
    • status:inspected
    • identified as RHEL
    • Cloud Meter reports accurate syspurpose.json details

Actual Result

  • Even after 60 minutes, instances have not made it through the inspection process. Process seems to have frozen.

Additional context

All of these images are originally copies of Red Hat's Cloud Accounts (Access2) images. Based on our present understanding of our code and the way AWS behaves, that should be OKAY and largely irrelevant since the new copies both A) have a different owner account ID and B) have new names that DO NOT match the Cloud Access naming conventions. However, it's a possible connection to weird inspection behaviors!!

889259.txt

889260.txt

889344.txt

889345.txt

Inspection error resulting in ecs cluster not scaling down

In GitLab by @werwty on Mar 10, 2020, 16:18

Brad: Regarding the reaper, I just peeked at the CloudWatch Log Group for qa/test, and the last logs were at "2020-03-10 09:38 UTC-4" and appear to have normal activity in them. I wonder if something prevented the last run from even starting up successfully before it could log anything.

Brad: digging back up through the cloudwatch logs in test from earlier today, I found this curiosity…
Provided drive(s) to inspect: (('ami-065d0be9883aa064a', '/dev/xvdba'), ('ami-065d0be9883aa064a', '/dev/xvdbb'), ('ami-065d0be9883aa064a', '/dev/xvdbc'), ('ami-065d0be9883aa064a', '/dev/xvdbd'), ('ami-065d0be9883aa064a', '/dev/xvdbe'), ('ami-065d0be9883aa064a', '/dev/xvdbf'), ('ami-065d0be9883aa064a', '/dev/xvdbg'), ('ami-065d0be9883aa064a', '/dev/xvdbh'), ('ami-065d0be9883aa064a', '/dev/xvdbi'), ('ami-065d0be9883aa064a', '/dev/xvdbj'), ('ami-065d0be9883aa064a', '/dev/xvdbk'), ('ami-065d0be9883aa064a', '/dev/xvdbl'), ('ami-065d0be9883aa064a', '/dev/xvdbm'), ('ami-065d0be9883aa064a', '/dev/xvdbn'), ('ami-065d0be9883aa064a', '/dev/xvdbo'), ('ami-065d0be9883aa064a', '/dev/xvdbp'), ('ami-065d0be9883aa064a', '/dev/xvdbq'), ('ami-065d0be9883aa064a', '/dev/xvdbr'), ('ami-065d0be9883aa064a', '/dev/xvdbs'), ('ami-065d0be9883aa064a', '/dev/xvdbt'), ('ami-065d0be9883aa064a', '/dev/xvdbu'), ('ami-065d0be9883aa064a', '/dev/xvdbv'), ('ami-065d0be9883aa064a', '/dev/xvdbw'), ('ami-065d0be9883aa064a', '/dev/xvdbx'), ('ami-065d0be9883aa064a', '/dev/xvdby'), ('ami-065d0be9883aa064a', '/dev/xvdbz'), ('ami-065d0be9883aa064a', '/dev/xvdca'))
That's an awful lot of the same image which makes me think something buggy may be going on there.

Create a documentation outline process for testing MR in AWS

In GitLab by @kholdaway on Aug 14, 2018, 18:47

Houndigrade can be run locally using docker-compose or in AWS. Before MR is merged, a contributor should push a docker image to a registry and test the code from AWS. This issue is a request to have the best practice documented to standardize development on this repository.

Don't exit early when path doesn't exist

In GitLab by @infinitewarp on Feb 19, 2019, 12:58

Summary

As a sysop investigating inspection issues, I want houndigrade to report in its results when the device path cannot be found so that I can better diagnose that problem.

Currently, the device paths are type-checked as input arguments. If the device path is not found, input validation fails, the main code is never called, and nothing is written to the results queue. This has a later side effect in cloudigrade of never scaling down the cluster because cloudigrade (currently) needs to find something in the results even if it's just an error.

Acceptance Criteria

  • Verify that running houndigrade with an invalid inspection target results in an appropriate error for that target on the results message queue

Assumptions and Questions

As a dev, I want a set of tests to be run with a running houndigrade container on my MRs and master

In GitLab by @elijahd on Sep 4, 2018, 13:57

Summary

As user dev, I want my MRs to build the houndigrade container and run it, having it inspect some test volumes. I want the results evaluated so I can catch any regressions before I merge to master and before we test it integrated with cloudidgrade.

Acceptance Criteria

  • Verify that MRs build and run a houndigrade container in some way
  • Verify that the houndigrade container inspects a set of test volumes
  • Verify that the results of the inspection are verified and reported in the job

Assumptions and Questions

  • We need to figure out how we would like to run the houndigrade container. We could run it in an ECS cluster, or we could do something directly in gitlab-ci.
  • The intent is to make it transparent how houndigrade can be run and tested and what the results are from changes to houndigrade

New product certificate for RHEL8

In GitLab by @infinitewarp on Jul 19, 2019, 12:50

Summary

As a user of RHEL8, I want houndigrade to look for 479.pem so my images will be correctly identified.

Acceptance Criteria

Assumptions and Questions

  • Based on what I found by looking at some RHEL8 AMIs, it seems that the new product certificate is 479.pem, but we are currently only looking for 69.pem.

Static analysis should be truly static

In GitLab by @infinitewarp on Aug 14, 2018, 11:10

Summary

As of https://gitlab.com/cloudigrade/houndigrade/merge_requests/64, we are now chrooting to the mounted volume and running binaries that live in that volume. Based on conversations we've had outside of GitLab, the act of chrooting to the volume means we are also running the yum binary that lives in that volume. This has potentially dangerous implications because we do not know or control the binaries in the images that we are chrooting. We could unknowingly execute malicious code by running yum or any other binary that belongs to third-party disk images.

Since houndigrade is open source, it seems like it would be trivially easy to craft an image that could exploit this behavior.

Steps to Reproduce

  1. Look at the code.
  2. Observe how we use the chroot command.

Expected Result

  • houndigrade does not execute third-party binaries.

Actual Result

  • houndigrade executes third-party binaries.

Additional context

If we have to run yum, we should use our own version of the binary, make some reasonable assumptions, and attempt to direct it to standard yum.repos.d locations within the mounted volume. If we don't actually have to run yum, consider using Python's built-in configparser module to look at the files in the standard yum.repos.d locations.

Detect RHEL more comprehensively

In GitLab by @infinitewarp on Jul 20, 2018, 19:00

Summary

As a customer, I want houndigrade to more accurately and comprehensively detect RHEL and for cloudigrade to record and return houndigrade's detailed reasons so I can trust its results.

Note: Resolving this issue should include changes for both houndigrade and cloudigrade.

When we first built houndigrade, we simply looked at /etc/release files to determine if the image contains RHEL. Since then, we've learned that there are better ways to make this determination. We need to find those ways and try to implement them.

See also OnDemand Pilot Reqs: Public Cloud which references Formalize Fingerprints. As of the time of this writing, that definition includes:

  • 1 or more Red Hat signed packages from a RHEL repository
  • Product Certificate exists as /etc/pki/product/*.pem . In (decrypted) cert contents, find ...
    • Product:
      • Name: foo
      • Foo matches “Red Hat Enterprise Linux”
  • Release file indicates RHEL
    • Follow up on consistency of naming in release file (or lack thereof)
  • Active RHEL repositories
    • Repository configuration files indicate that a stanza is active (active = 1)
      • Active repository stanza matches “Red Hat Enterprise Linux”

Implementation details of the API response payload to be determined during development. Be sure to discuss this with @cdcabrera along the way so we don't make bad decisions for frontigrade.

Acceptance Criteria

  • Verify RHEL is identified by houndigrade for each of the criteria:
    • an image with 1 or more Red Hat signed packages from a RHEL repository
    • an image with 1 or more product certificates in /etc/pki/product/*.pem that has a product name matching "Red Hat Enterprise Linux"
    • an image with a release file that indicates RHEL
    • an image with a repository with active=1 whose stanza matches "Red Hat Enterprise Linux"
    • an image that meets two or more of the above criteria
  • Verify that the result data structure sent from houndigrade's inspection includes something to indicate presence of successful match on each of the criteria.
  • Verify that the presence of each of these criteria is individually recorded in cloudigrade's model.
  • Verify that if any criteria are met, we indicate that the image "is RHEL".
  • Verify that if no criteria are met, we indicate that the image "is not RHEL".
  • Verify that an issue exists for tracking integration tests! (see https://gitlab.com/cloudigrade/integrade/issues/85)

Assumptions and Questions

  • How many (and which) of these criteria are critical for the pilot launch? Can or should any of these be deferred?
    • Answer: All of them unless something is impossible. The most critical is the "is a Red Hat signed package installed".
  • We are not yet concerned with the "challenging" issue in this issue's implementation. Further, issuing a "challenge" affects the whole image for this user, not specific criteria.
  • Do we need to enumerate all of the specific things we find? For example, if there are multiple certificates, do we need to track all of them?
    • Answer: No, we just track "yes" or "no" to each of the four general criteria:
      • Y/N? found a red hat signed package installed
      • Y/N? found a red hat product certificate
      • Y/N? found a red hat release file
      • Y/N? found an active red hat repository
  • The list of things we ultimately give the user will be ways we positively identified RHEL, not including all the ways to did not find RHEL.

Enable inspection of images using LVM

In GitLab by @elijahd on Aug 28, 2018, 14:01

Summary

As user migrating my images from my private cloud, I want cloudigrade to be able to inspect them even if they use LVM, which is the default for RHEL. That way my custom AMIs I created on amazon from my virtual machines get inspected and correctly identified as RHEL so I can use cloudigrade!

Acceptance Criteria

  • Verify an image using LVM with RHEL installed inside any of its logical volumes in any of their physical volumes is appropriately mounted and inspected by houndigrade and reports "RHEL = True".
  • Verify an image using LVM without RHEL installed inside any of its volumes is appropriately mounted and inspected by houndigrade and reports "RHEL = False".
  • Update our overview document with notes about partition/LVM format: https://gitlab.com/cloudigrade/cloudigrade/-/wikis/cloudigrade-use-case-overview-and-FAQs
  • Work with MB to write a document that outlines numerous use cases and expectations about when we should or should not report RHEL detection. @infinitewarp edit: we don't need this now
    • For example... partition table vs LVM, multiple partitions, multiple volumes, no partition table, unreadable filesystem format, mix of volumes with one RHEL and others not RHEL, etc etc etc.
    • Be sure to note in this document how houndigrade's findings are only part of the equation. Remember that cloudigrade knows things outside of houndigrade's scope such as tags, cloud access, marketplace, etc. Enumerate these too.
    • This document might be useful if formatted like a decision tree. "If X, then Y. If Y, then Z. etc."
  • Verify QE has built various images using LVM and included them in automated IQE integration tests.
  • Verify cloud.redhat.com docs for Subscription Watch for Cloud Meter have been updated to remove qualifiers about LVM support and instead (maybe?) express that we support both LVM and traditional partition tables in AWS AMIs.

Assumptions and Questions

  • We should assume no knowledge about the disk formats outside of houndigrade. houndigrade itself should look at the device and make an informed decision about how to mount and interact with its filesystems regardless of LVM vs plain old partition tables. houndigrade should attempt to mount and inspect every possible partition on the device.
  • If old partition table, inspect all partitions. If LVM, inspect all volumes.
  • Assume that all the physical volumes for LVM are in the same images.
  • When we review our docs for public users, please remember to add some kind of message to encourage users to use tags to identify RHEL! That would help cover probably other use cases beyond the "naive solution" we are implementing for LVM here.
    • Further point about tags: Let's have a conversation with product people about syspurpose, etc. Can we get the user to put more things in tags? Not just a binary true/false of presence indication? TODO Maybe raise this in a new issue for spike/story.

Turn Houndigrade Logging up to 11

In GitLab by @katherine-black on Sep 10, 2020, 10:26

Summary

As a cloudigrade dev, I want houndigrade to turn logging up to 11, so I'm always ready for whatever new issues it throws at us.

Acceptance Criteria

  • Verify we capture more information about the state of the attached devices and potential formats and filesystems on them.
  • Whoever picks this up, share knowledge of the commands we're going to use with the team so we all have a better understanding of ways to diagnose filesystem curiosities.
    • Put a wiki page (or something?) to summarize this new knowledge and what tools we use with examples how how to use them for troubleshooting.
  • Talk about this change and the programs it calls during our sprint demo.
  • QE to manually verify that these new logs exist and to familiarize with how this works. QE will likely not fully automate reading and checking the CloudWatch logs for houndigrade with this issue, but maybe later.

Assumptions and Questions

  • Can we capture FS information? Maybe run these command and dump them to stdout so we can see them in the logs.
    • lsblk?
    • fdisk -l?
  • If there are other handy command-line programs we can use, research and note them here.
  • We accept that this story is vaguely defined. We just want to capture more data to make our investigations easier!

Update our EC2 launch configs for ECS to use latest AMI for modern agent

In GitLab by @infinitewarp on Aug 12, 2020, 13:42

Important note!

This work is effectively blocked on cloudigrade/cloudigrade#738 because 738's implementation might handle this issue "for free".

Summary

As a cloudigrade sysop, I want my houndigrade ECS cluster EC2 launch configuration to use modern versions of available ECS AMIs so that I am more confident about having a supported and working Docker agent for ECS.

In our recent investigations, we discovered that most (all?) of our existing ECS setups were using very old AMIs, and they presented warnings in the AWS web console that they were not running recent versions of their agents. We suspected this could contribute to some of our mysteriously failing inspection tasks. This means we want to periodically get the latest supported ECS EC2 AMI and update our launch definitions to use that latest AMI.

To perform this update, we should consider implementing an independent scheduled process that is responsible for checking for recent AMIs and updating the various things in AWS to make ECS use that latest AMI. Consider making this the responsibility of the "reaper" in GitLab CI. Consider renaming "reaper" to something more appropriate like "janitor"?

Let's do this with a GitLab CI pipeline that behaves like our old cloudigrade release pipeline:

  • CI and QA automatically get "latest"
    • updating stage is a manual step that takes the version used in CI/QA
      • updating prod is a manual step that takes the version used in stage

Acceptance Criteria

  • Verify a periodic job exists to perform the following actions.
  • Verify we have new AWS EC2 Launch Configuration with the newly chosen AMI ID.
  • Verify we have updated the AWS EC2 Auto Scaling group with the new Launch Configuration's ID.
  • Verify that houndigrade actually runs through the normal inspection process.
  • Verify the job applies changes to:
    • prod ECS cluster
    • stage ECS cluster
    • qa ECS cluster
    • ci ECS cluster
    • NOT each of our personal ECS clusters
  • QE to add some test case to verify that the updated reaper/janitor actually gets a recent AMI ID.
    • This might mean to check and compare two things: the current latest AMI ID from AWS ECS and the currently configured value assigned to the Launch Configuration.
    • This is not a fully-automated test. This is a one-time verification check that may be manual if that's easier.
  • QE to run manually at least once the IQE test suite to verify inspection still works after at least one run of the new reaper/janitor update process.
    • This is not a fully-automated test. This is a one-time verification check that may be manual if that's easier.

Assumptions and Questions

  • Keep in mind that we may need to manually coordinate config changes in app-interface if the Auto Scaling group name changes!

Looking for 69.pem in all the wrong places

In GitLab by @infinitewarp on Feb 22, 2019, 18:57

Summary

houndigrade looks for 69.pem in /etc/pki/product/, but sometimes that file lives elsewhere.

Steps to Reproduce

  1. start an EC2 instance with RHEL7 from the marketplace
  2. ssh into the running instance
  3. sudo find /etc/pki -name 69.pem

Expected Result

  • /etc/pki/product/69.pem

Actual Result

  • /etc/pki/product-default/69.pem

Additional context

/etc/pki/product/ is the path we were given in #34.

houndigrade only looks in /etc/pki/product/ for 69.pem which means if houndigrade were to inspect an image set up like this marketplace image, it would not identify as RHEL-positive for product certs.

Yes, I'm aware that we don't inspect marketplace images. However, if this image has 69.pem not under our expected path, I think it's likely other customer installations may also.

Perhaps we should just look recursively under /etc/pki/.

QE says adding unit tests should be sufficient for verification. No need to expand the integration test suite.

Detect RHEL by a standard AWS image tag

In GitLab by @infinitewarp on Dec 23, 2019, 13:26

Summary

As an AWS user, I want cloudigrade to detect RHEL by presence of a tag so that I don't have to grant access to my base image.

Acceptance Criteria

  • Verify images with a standard tag key/value are recorded as having RHEL and do not proceed to copy for houndigrade inspection.
  • Verify images without a standard tag key/value are not recorded as having RHEL and do proceed to copy for houndigrade inspection.

Assumptions and Questions

  • What key/value should we standardize for expecting RHEL presence?
    • We should use: cloudigrade-rhel-present
  • Should we also have a key/value for expecting RHEL absence?
    • Not for now; let's reconsider this if/when we also implement absence checking for OCP tags.

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Inspection of multiple i386 images does not detect RHEL

In GitLab by @mpierce on Jul 29, 2020, 17:21

Summary

When I

  • run one instance each of two different i386 images that should both be identified as RHEL,
  • then create a source and thereby a Cloud Meter account,

Houndigrade's inspection process doesn't recognize my instances as RHEL

Steps to Reproduce

  1. Launch an instance each of the below two images
  2. Create a Source, Cloud Meter account
  3. Wait for inspection process to complete
  4. Check facts returned by Cloud Meter about each instance

Image details:

Image(
        'ami-02c0a30866bcb43a8',
        'i386',
        't1.micro',
        'premium',
        'Red Hat Enterprise Linux Server',
        'i386-premium-cmTest1'
        ),
    Image(
        'ami-0e30b17e75753b989',
        'i386',
        't1.micro',
        'premium',
        'Red Hat Enterprise Linux Console',
        'i386-standard-cmTest2'
        ),

The above images each have a /etc/rhsm/syspurpose/syspurpose.json file with content similar to:

{
  "role": "Red Hat Enterprise Linux Server",
  "service_level_agreement": "premium",
  "usage": "test"
}

And they each have a /etc/redhat-release file containing:

Red Hat Enterprise Linux Server release 6.5 (Santiago)

Expected Result

  • Both instances are recognized as RHEL by Cloud Meter and show up in the data returned by the Concurrent API

Actual Result

  • Both instances are marked RHEL: False and syspurpose: None

Additional context

This happens in the QA environment and hasn't been tested in other environments.
The AWS account used here is DEV07.
I gathered details about images, instances and the account which can be found in attached files.

888950.txt

888951.txt

Minimize test data

In GitLab by @infinitewarp on Aug 14, 2018, 10:44

Summary

As a developer, I don't want to have to manually find and download gigabytes of images in order to test houndigrade locally. Test data should be minimally small and easily accessible in an automated fashion.

Acceptance Criteria

  • Verify test data is quickly retrieved via regular commands in a local command line shell
    • Hint hint: probably a git submodule

Assumptions and Questions

docker-compose up fails

Errors:

app_1  | Provided cloud: aws
app_1  | Provided drive(s) to inspect: (('ami-redhatami', '/dev/loop7'), ('ami-centosami', '/dev/loop8'), ('ami-both', '/dev/loop9'))
app_1  | Checking drive /dev/loop7
app_1  | Checking partition /dev/loop7p1
app_1  | RHEL found on: /dev/loop7p1
app_1  | RHEL found on: /dev/loop7p1
app_1  | Checking drive /dev/loop8
app_1  | Checking partition /dev/loop8p1
app_1  | RHEL not found on: /dev/loop8p1
app_1  | RHEL not found on: /dev/loop8p1
app_1  | RHEL not found on: /dev/loop8p1
app_1  | Checking drive /dev/loop9
app_1  | Checking partition /dev/loop9p1
app_1  | RHEL not found on: /dev/loop9p1
app_1  | RHEL not found on: /dev/loop9p1
app_1  | RHEL found on: /dev/loop9p1
app_1  | Traceback (most recent call last):
app_1  |   File "cli.py", line 250, in <module>
app_1  |     main()
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/click/core.py", line 722, in __call__
app_1  |     return self.main(*args, **kwargs)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/click/core.py", line 697, in main
app_1  |     rv = self.invoke(ctx)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/click/core.py", line 895, in invoke
app_1  |     return ctx.invoke(self.callback, **ctx.params)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/click/core.py", line 535, in invoke
app_1  |     return callback(*args, **kwargs)
app_1  |   File "cli.py", line 58, in main
app_1  |     report_results(results)
app_1  |   File "cli.py", line 156, in report_results
app_1  |     queue_url = _get_sqs_queue_url(queue_name)
app_1  |   File "cli.py", line 137, in _get_sqs_queue_url
app_1  |     sqs = boto3.client('sqs')
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/boto3/__init__.py", line 83, in client
app_1  |     return _get_default_session().client(*args, **kwargs)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/boto3/session.py", line 263, in client
app_1  |     aws_session_token=aws_session_token, config=config)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/session.py", line 861, in create_client
app_1  |     client_config=config, api_version=api_version)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/client.py", line 76, in create_client
app_1  |     verify, credentials, scoped_config, client_config, endpoint_bridge)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/client.py", line 285, in _get_client_args
app_1  |     verify, credentials, scoped_config, client_config, endpoint_bridge)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/args.py", line 45, in get_client_args
app_1  |     endpoint_url, is_secure, scoped_config)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/args.py", line 111, in compute_client_args
app_1  |     service_name, region_name, endpoint_url, is_secure)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/client.py", line 358, in resolve
app_1  |     service_name, region_name)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/regions.py", line 122, in construct_endpoint
app_1  |     partition, service_name, region_name)
app_1  |   File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
app_1  |     raise NoRegionError()
app_1  | botocore.exceptions.NoRegionError: You must specify a region.

Capture houndigrade stdout logs

In GitLab by @infinitewarp on Mar 20, 2019, 13:58

Summary

As a houndigrade user, I want to see output from houndigrade runs in AWS so I can diagnose potentially problematic behaviors.

Put logs in Sentry? No. Or maybe AWS CloudWatch? Probably yes.

We want this to work everywhere, not just in prod-like environments.

Acceptance Criteria

  • Verify that we can see houndigrade's output
  • Verify this solution works in any AWS environment, not just prod-like

Assumptions and Questions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.