taskcluster-worker-checker's Issues

Pull machine data (such as ilo:port) from Google Sheet

Currently when we have to work with Windows and/or Linux and we need to access the machine via ILO, we always have to take the connection information from within the Google Sheet (Moonshot Master Inventory) then continue our work as usual.

This is quite inefficient and prone to mistake, this request will fix this, but also give us extra information about a machine when we need it, such as:

Chassis
Cartridge Number
ILO:PROT
Machine Notes
and anything else we have inside the tracking sheet.

Doubled entries for OSX

ISSUE : doubled entries only for osx machines
When: this happens when running

python client.py -v -l 2 or python client.py -v -l 1 ... etc
and at the menu, selecting :
11
This doesn't occurs on menu selection : 14

Add missing licenses

We need to add Mozilla's License and QT license to the repository code.

Fix Failed JSON Responses

Whenever we want to load, read and decode the JSON response from TaskCluster, it happens that workers is empty.
Currently we are doing a try/except method which didn't fix the issue, but at least we get the data after a few retries.

[Tracking] TWC 1.0 release

Introduction

With TaskCluster Worker Checker growing to be more than just a tool used by CiDuty, we have to rethink how we handle logic, defs, code and many areas of the tool.

We also have a lot of areas that we can now remove and retire. Below will be a rough outline of what we need to do:

Where?

All work will be done on branch twc1.0

ToDo

Finish implementation of GoogleAPI
Remove machines_to_ignore = {...dictionary...}
Rewrite parse_taskcluster_json()
Remove generate_machine_lists()
Put Google Auth login in a def
Rewrite def main() and break it into smaller defs
Implement UnitTesting
When ready to release, Lock prior master code under version tag 0.9

Grab machines from ServiceNow

Currently we are generating a hard-coded list based on ranges that could change at any time!
We should be using Mozilla's ServiceNow API and grab all the machines we are interested on.

What to pay attention to:

The "magic" where we set() our 2 diffs expects a list with the machine name as: t-linux64-ms-280
We print a quick ssh command and do MDC logic based on generate_machine_lists(workertype) -> global mdc1,mdc2. We don't want to break this, but we can also look at better ways to handle this.

Implement Loaner Machines Exclusion

We currently don't exclude the loaner/dev machines in the final output.

This should be an easy first fix!
Simply populate: ignore_ms_* with the machines we don't need, we could simply do our look-up/pop before a = set(workersList) happens.

Over-engineered expression

doing this:
python3 client.py -w WORKER_TYPE -u LDAP_USERNAME | cat >> missing.txt

can be replaced with:
python3 client.py -w WORKER_TYPE -u LDAP_USERNAME > missing.txt

The use of "| cat >>" is a bit over-engineered.

Dictionaries instead of variables.

With the latest commit 5827b3f, I introduce a lot of new variables and added complexity on how we remove all the machines that shouldn't be shown in the final output.
The implementation did by me is also ugly and doesn't really offer to much information.

Proposed changes

Instead of using windows_pxe, linux_pxe, osx_other_problems, etc as standalone variables, experiment using dictionaries as we can hold more information

Data structure

problem_machines = { 
    linux: {
        pxe_issues: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
        hdd_issues: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
        other_issues: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
       loaner: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
    },
    windows: {
        pxe_issues: { 
            "t-w1064-ms-001": "BUG ID HERE",
            "t-w1064-ms-002": "BUG ID HERE"
        },
        hdd_issues: { 
            "t-w1064-ms-001": "BUG ID HERE",
            "t-w1064-ms-002": "BUG ID HERE"
        },
        other_issues: { 
            "t-w1064-ms-001": "BUG ID HERE",
            "t-w1064-ms-002": "BUG ID HERE"
        },
        loaner: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
    },
}

What will this change fix:

This will let us use code that is much cleaner and easier to use. One simple dictionary is easier to maintain and can store much more useful information, such as BUG IDs, via keys and values.

This change is up for grabs, but will require quite some changes around the script.

Fix MacOSX ssh command.

In the logic at line168 we generate the FQDN wrong.

Current Output:
ssh [email protected]

Expected Output:
ssh [email protected]

[Core] User customization via userSettings.json

A lot of hardcoded values and default variable values can be moved to a user_settings.json file.
This will give users the ability to customize TWC to accept their own PC configuration.

ToDo:

Create an generate the structure of user_settings.json
Store iLO password and iLO path in a encrypted format, using the PC's unique ID as salt-password
Ability to change iLO password, with integrity check.
Ability to change iLO installation path, with integrity check.
Ability yo change click coordinates per display.
Rename configuration.py to run_flags.py to avoid confusion.
Ability to change the hole user_settings.json back to default values.

[AutoReboot] Implement screen reading to detect connection.

Using pygui.locateOnScreen("screenshot", confidence=1.0) we can detect the iLO app and than look in the top left corner of the screen for the " _ " list that tells us we are successfully connected to it.

This will remove the "wait 5-10 seconds" after we click on the connect button, than wait for the connection and hope that in the specified time we gonna be connected, part of the code which ain't very pretty.

Add some kind of ping function

We could add a function that will ping hosts that are idle > 6 hours and returns results. What's your opinion , would this be useful ?

Modify the script to handle the new Windows Moonshots from MDC2

At the moment a new pool of ~30 machines has been introduced to the pool (23 machines successfully re-imaged and new machines are on their way)

[Core] Create a function to generate AES+Salt passwords

This issue will help progress #133

Implementation idea:

Use cryptography to generate the password
Use a salt password that is the Unique PC's ID.
Store it in it's encrypted format in the json file.

Add some kind of parallelssh-likely code for yosemite/linux

Would be nice to have an integrated script for running commands (like reboot) on known machines that our current script returned as being problematic.

I have been working on a piece of code to do so but I've got stuck at the point we have to provide our DUO for logging in.

I've asked Aki for help. It's nice to have it there as well so anybody can come with ideas.

For now we have the following:

import paramiko

ssh = paramiko.SSHClient()

passPhrase = raw_input("What's your passphrase for private key?") #asks to provide password for passphrase

ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) 

privkey = paramiko.RSAKey.from_private_key_file('/home/rolandmutter/.ssh/id_rsa', password=passPhrase) #locates private key and unlocks it

ssh.connect('t-yosemite-r7-230.test.releng.mdc2.mozilla.com', username='root', pkey=privkey) #main fuction to connect to host , that is hardcoded for now for test purposes

#paramiko.util.log_to_file(os.path.expanduser('/home/rolandmutter/paramiko.log'), logging.DEBUG)

stdin, stdout, stderr = ssh.exec_command('ls') # command to execute, again, hardcoded

print stdout.readlines() # prints what is shown in the machine bash

ssh.close()` # closes connection

Generate the 'json_data/verbose_google_dict.json' earlier

New clones need to be run with -v first time, otherwise we are getting a "No such file or directory" error.

[GSheet] Keep track of the number of actions taken

In the master inventory sheet, we have a column CiDuty CLI # of Actions Taken this column should show how many times TWC has taken action on a set machine.

For now the only action that we can do is automated reboot, but in the future we are looking at ways to implement automated re-image also.

Workflow:

Get List of Machines
Get current CiDuty CLI # of Actions Taken
Take action
If action succesfull, update int(CiDuty CLI # of Actions Taken) + 1

Make the script use the correct DC lists

While trying to modify the script to accept the new machines, I observed that atm we are using "mdc2_range" from linux to windows thus, providing misleading information.

Use case:
print("Total of missing server : {}".format(len(missing_machines) - len(mdc2_range)))

I also tried to correct this but for some reason, the "mdc_range" from generating windows it's not accessible.

Filter Table taking Default values into consideration

This issue is for the GUI branch.

Currently we show all the machines if we don't have IDLE and/or Ignored? selected.
By default the filter logic should go as described below, even if nothing is selected (IDLE/Ignored) in the UI:

Is Machine Ignored? Yes? Don't show it in the table. No? Show it.
Is the machine IDLE for 6 Hours or more? Yes? Show it in the table. No? Don't show it.

Basically we are missing "default values" which will be re-written by the custom input that the user sets in the UI.

[Core] Move imports from client.py to py setuptools

This issue will help us streamline the process of installing TWC.

No entries while checking status on only one type of machines

ISSUE : No entries while checking status on single type of machines
When : This occurs only while checking status of only one type of machines, on windows or linux.
It doesn't occurs with osx machines.

danlabici / taskcluster-worker-checker Goto Github PK

taskcluster-worker-checker's People

Contributors

Stargazers

Watchers

Forkers