Giter Club home page Giter Club logo

taskcluster-worker-checker's People

Contributors

atcraciun avatar bccrisan avatar bolchisb avatar danlabici avatar escapewindow avatar mutterroland avatar popadrianc avatar raduiman-zz avatar zsoltfay avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

taskcluster-worker-checker's Issues

Pull machine data (such as ilo:port) from Google Sheet

Currently when we have to work with Windows and/or Linux and we need to access the machine via ILO, we always have to take the connection information from within the Google Sheet (Moonshot Master Inventory) then continue our work as usual.

This is quite inefficient and prone to mistake, this request will fix this, but also give us extra information about a machine when we need it, such as:

  • Chassis
  • Cartridge Number
  • ILO:PROT
  • Machine Notes
  • and anything else we have inside the tracking sheet.

Doubled entries for OSX

ISSUE : doubled entries only for osx machines
When: this happens when running

python client.py -v -l 2 or python client.py -v -l 1 ... etc
and at the menu, selecting :
11
This doesn't occurs on menu selection : 14

Fix Failed JSON Responses

Whenever we want to load, read and decode the JSON response from TaskCluster, it happens that workers is empty.
Currently we are doing a try/except method which didn't fix the issue, but at least we get the data after a few retries.

[Tracking] TWC 1.0 release

Introduction

With TaskCluster Worker Checker growing to be more than just a tool used by CiDuty, we have to rethink how we handle logic, defs, code and many areas of the tool.

We also have a lot of areas that we can now remove and retire. Below will be a rough outline of what we need to do:

Where?

All work will be done on branch twc1.0

ToDo

  • Finish implementation of GoogleAPI
  • Remove machines_to_ignore = {...dictionary...}
  • Rewrite parse_taskcluster_json()
  • Remove generate_machine_lists()
  • Put Google Auth login in a def
  • Rewrite def main() and break it into smaller defs
  • Implement UnitTesting
  • When ready to release, Lock prior master code under version tag 0.9

Grab machines from ServiceNow

Currently we are generating a hard-coded list based on ranges that could change at any time!
We should be using Mozilla's ServiceNow API and grab all the machines we are interested on.

What to pay attention to:

  • The "magic" where we set() our 2 diffs expects a list with the machine name as: t-linux64-ms-280
  • We print a quick ssh command and do MDC logic based on generate_machine_lists(workertype) -> global mdc1,mdc2. We don't want to break this, but we can also look at better ways to handle this.

Implement Loaner Machines Exclusion

We currently don't exclude the loaner/dev machines in the final output.

This should be an easy first fix!
Simply populate: ignore_ms_* with the machines we don't need, we could simply do our look-up/pop before a = set(workersList) happens.

Over-engineered expression

doing this:
python3 client.py -w WORKER_TYPE -u LDAP_USERNAME | cat >> missing.txt

can be replaced with:
python3 client.py -w WORKER_TYPE -u LDAP_USERNAME > missing.txt

The use of "| cat >>" is a bit over-engineered.

Dictionaries instead of variables.

With the latest commit 5827b3f, I introduce a lot of new variables and added complexity on how we remove all the machines that shouldn't be shown in the final output.
The implementation did by me is also ugly and doesn't really offer to much information.

Proposed changes

Instead of using windows_pxe, linux_pxe, osx_other_problems, etc as standalone variables, experiment using dictionaries as we can hold more information

Data structure

problem_machines = { 
    linux: {
        pxe_issues: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
        hdd_issues: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
        other_issues: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
       loaner: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
    },
    windows: {
        pxe_issues: { 
            "t-w1064-ms-001": "BUG ID HERE",
            "t-w1064-ms-002": "BUG ID HERE"
        },
        hdd_issues: { 
            "t-w1064-ms-001": "BUG ID HERE",
            "t-w1064-ms-002": "BUG ID HERE"
        },
        other_issues: { 
            "t-w1064-ms-001": "BUG ID HERE",
            "t-w1064-ms-002": "BUG ID HERE"
        },
        loaner: { 
            "t-linux64-ms-001": "BUG ID HERE",
            "t-linux64-ms-002": "BUG ID HERE"
        },
    },
}

What will this change fix:

This will let us use code that is much cleaner and easier to use. One simple dictionary is easier to maintain and can store much more useful information, such as BUG IDs, via keys and values.

This change is up for grabs, but will require quite some changes around the script.

[Core] User customization via userSettings.json

A lot of hardcoded values and default variable values can be moved to a user_settings.json file.
This will give users the ability to customize TWC to accept their own PC configuration.

ToDo:

  • Create an generate the structure of user_settings.json
  • Store iLO password and iLO path in a encrypted format, using the PC's unique ID as salt-password
  • Ability to change iLO password, with integrity check.
  • Ability to change iLO installation path, with integrity check.
  • Ability yo change click coordinates per display.
  • Rename configuration.py to run_flags.py to avoid confusion.
  • Ability to change the hole user_settings.json back to default values.

[AutoReboot] Implement screen reading to detect connection.

Using pygui.locateOnScreen("screenshot", confidence=1.0) we can detect the iLO app and than look in the top left corner of the screen for the " _ " list that tells us we are successfully connected to it.

This will remove the "wait 5-10 seconds" after we click on the connect button, than wait for the connection and hope that in the specified time we gonna be connected, part of the code which ain't very pretty.

Add some kind of ping function

We could add a function that will ping hosts that are idle > 6 hours and returns results. What's your opinion , would this be useful ?

Add some kind of parallelssh-likely code for yosemite/linux

Would be nice to have an integrated script for running commands (like reboot) on known machines that our current script returned as being problematic.

I have been working on a piece of code to do so but I've got stuck at the point we have to provide our DUO for logging in.

I've asked Aki for help. It's nice to have it there as well so anybody can come with ideas.

For now we have the following:

import paramiko

ssh = paramiko.SSHClient()

passPhrase = raw_input("What's your passphrase for private key?") #asks to provide password for passphrase

ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) 

privkey = paramiko.RSAKey.from_private_key_file('/home/rolandmutter/.ssh/id_rsa', password=passPhrase) #locates private key and unlocks it

ssh.connect('t-yosemite-r7-230.test.releng.mdc2.mozilla.com', username='root', pkey=privkey) #main fuction to connect to host , that is hardcoded for now for test purposes

#paramiko.util.log_to_file(os.path.expanduser('/home/rolandmutter/paramiko.log'), logging.DEBUG)

stdin, stdout, stderr = ssh.exec_command('ls') # command to execute, again, hardcoded

print stdout.readlines() # prints what is shown in the machine bash

ssh.close()` # closes connection

[GSheet] Keep track of the number of actions taken

In the master inventory sheet, we have a column CiDuty CLI # of Actions Taken this column should show how many times TWC has taken action on a set machine.

For now the only action that we can do is automated reboot, but in the future we are looking at ways to implement automated re-image also.

Workflow:

  1. Get List of Machines
  2. Get current CiDuty CLI # of Actions Taken
  3. Take action
  4. If action succesfull, update int(CiDuty CLI # of Actions Taken) + 1

Make the script use the correct DC lists

While trying to modify the script to accept the new machines, I observed that atm we are using "mdc2_range" from linux to windows thus, providing misleading information.

Use case:
print("Total of missing server : {}".format(len(missing_machines) - len(mdc2_range)))

I also tried to correct this but for some reason, the "mdc_range" from generating windows it's not accessible.

Filter Table taking Default values into consideration

This issue is for the GUI branch.

Currently we show all the machines if we don't have IDLE and/or Ignored? selected.
By default the filter logic should go as described below, even if nothing is selected (IDLE/Ignored) in the UI:

  1. Is Machine Ignored? Yes? Don't show it in the table. No? Show it.
  2. Is the machine IDLE for 6 Hours or more? Yes? Show it in the table. No? Don't show it.

Basically we are missing "default values" which will be re-written by the custom input that the user sets in the UI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.