suse-enceladus / azure-li-services Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 635 KB

Azure Large Instance Services

License: GNU General Public License v3.0

Makefile 0.43% Python 68.59% Shell 29.42% Perl 1.50% NSIS 0.07%

azure-li-services's People

Contributors

Stargazers

Watchers

azure-li-services's Issues

Need to disable EDAC (error detection/correction) on both LI and VLI systems

This had come up with in a lunch conversation with our HPE vendor, and I had this on my list to look into. Then, it turns out, it came up during meetings with Operations on compute setup:

We need to disable EDAC in the O/S for both LI and VLI as the underlying hardware handles this. This is done by:

modprobe -r sb_edac edac_core
vim /etc/modprobe.d/blacklist.conf and add below lines:
  blacklist sb_edac
  blacklist edac_core

Can this capability be added? Would this need YAML extension, or could you just handle it automatically since it's done for both LI and VLI?

Design of the package installer service

There is the packages section in the yaml config as follows:

packages:
  directory: "/mnt/directory-with-rpm-files"
  directory: "/mnt/another-directory-with-rpm-files"

I'd like to start a discussion on how those packages should become installed

Basically there are two options:

Take all packages from the listed directories and install them the brute force way by calling rpm
Create a clean repository from the packages using the createrepo tool, register that repo to the zypper packagemanager and install all packages in a transaction with the zypper tool

I vote to do it "right" with option 2. That's because of the following reasons:

A package install transaction through zypper checks for consistency and dependencies and allows to resolve all package dependencies in a clean way. Of course there is also potential that this fails resulting in the package to be not installable
A package installation via zypper takes all other configured repositories into account and helps to keep the system integrity intact
A package installation via zypper provides good error messages in terms of problems with dependencies, conflicts and others

The brute force way ignores all that and brain dead installs data on the machine which I personally would not do. If there are no objections I'll implement option 2.

Thoughts ?

No credentials: tag should result in an error

From PR #27

regarding credentials we need another issue to raise an error if no credentials are specified. Could you ??>please open that one ? Thanks

If no credentials tag exists in the YAML, that should be an error, since that would mean that you can't log into the machine in the end.

delete ssh-private-key from yaml after processing

Once processed the ssh-private-key should be deleted from the yaml file. This imho can be done as part of the cleanup service

Need to be able to specify GID for group creation

Early on, we made the decision to allow SUSE to choose the GID when creating groups.

It turns out that this won't work. We need to know the UID AND GID (not name) in order to set up the storage subsystem (it works from UID and GID, not names). So having the SUSE system choose the GID won't work, since we won't be able to set up storage. Storage must be set up in advance to be able to deploy the O/S.

I had asked for GID originally and, as our operations team was looking at what we provided, they pointed out this restriction. I was previously unaware of this (sorry).

This is a bit of a show-stopper for us, as we can't do a complete UCS (HLI) deployment, which is the next step in testing. I can download the existing image to do basic testing (make sure prior issues that were resolved were indeed fixed), but in terms of the next main test pass, we'll need this.

Service azure-li-storage is failing

I tested azure-li-storage, and it's failing. I believe there's a dependency problem, and the network needs to be up in order to mount NFS shares. Here's the relevant output:

linux:~ # systemctl -l status azure-li-storage
* azure-li-storage.service - Setup of Azure Li/VLi Storage Mountpoints
   Loaded: loaded (/usr/lib/systemd/system/azure-li-storage.service; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-06-27 18:04:46 UTC; 4min 8s ago
 Main PID: 26775 (code=exited, status=1/FAILURE)

Jun 27 18:04:46 linux azure-li-storage[26775]: azure_li_services.exceptions.AzureHostedCommandException: mount: stderr: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: , stdout: (no output on stdout)
Jun 27 18:04:46 linux systemd[1]: azure-li-storage.service: Main process exited, code=exited, status=1/FAILURE
Jun 27 18:04:46 linux systemd[1]: Failed to start Setup of Azure Li/VLi Storage Mountpoints.
Jun 27 18:04:46 linux systemd[1]: azure-li-storage.service: Unit entered failed state.
Jun 27 18:04:46 linux systemd[1]: azure-li-storage.service: Failed with result 'exit-code'.
Warning: azure-li-storage.service changed on disk. Run 'systemctl daemon-reload' to reload units.
linux:~ #

Can you fix this issue please?

DHCP isn't available in our bare-metal data centers

It turns out that DHCP servers are not available in our bare-metal data centers, and the latest YAML specification assumed that DHCP was available.

I suggest updating the YAML specification as follows:

REMOVE

nics: 2

ADD

  networking:
    -
      interface: eth0
      description: Client vlan-10
      ip: 10.260.10.51
      gateway: 10.250.10.1
      subnet mask: 255.255.255.0
    -
      interface: eth1
      description: NFS vlan-51
      ip: 10.260.51.51
      subnet mask: 255.255.255.0
    -
      interface: eth2
      description: B2B vlan-52
      ip: 10.260.52.51
      subnet mask: 255.255.255.0

In this particular case, you can infer that there are three NICs. Description is for Microsoft purposes only, and can be ignored by azure-li-services. A few things to note:

If no interface is defined, can you just pick any available interface (not otherwise specified elsewhere in the YAML specification)?
The gateway should be optional, and shouldn't be mandatory (depending on how that interface is used).
We might want to leave the option open for DHCP in the future. Perhaps add something like: dhcp: true or dhcp: false. If DHCP is false, then it should follow with IP/subnet mask[/gateway]. This is purely optional.

The latest complete YAML specification, with this proposed change, can be found here.

Service azure-li-user fails if storing a private key

I had problems storing a private key due to an unbound variable (resolved in issue #70), but with the latest build (1.0.36-Build2.2), a different problem surfaces:

azurehost:~ # systemctl status -l azure-li-user
* azure-li-user.service - Setup of Azure Li/VLi workload user
   Loaded: loaded (/usr/lib/systemd/system/azure-li-user.service; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2018-08-20 17:23:21 UTC; 22min ago
 Main PID: 37684 (code=exited, status=1/FAILURE)

Aug 20 17:23:21 linux azure-li-user[37684]: Traceback (most recent call last):
Aug 20 17:23:21 linux azure-li-user[37684]:   File "/usr/bin/azure-li-user", line 11, in <module>
Aug 20 17:23:21 linux azure-li-user[37684]:     load_entry_point('azure-li-services==1.1.6', 'console_scripts', 'azure-li-user')()
Aug 20 17:23:21 linux azure-li-user[37684]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 74, in main
Aug 20 17:23:21 linux azure-li-user[37684]:     raise AzureHostedException(user_setup_errors)
Aug 20 17:23:21 linux azure-li-user[37684]: azure_li_services.exceptions.AzureHostedException: [AzureHostedConfigFileSourceMountException('Source mount failed with: primary:mount: /dev/mapper/3600a09803830362f6e2b48516b525761-part1 is already mounted or /mnt busy\n       /dev/mapper/3600a09803830362f6e2b48516b525761-part1 is already mounted on /mnt\n, fallbackmount: special device /dev/dvd does not exist\n',)]
Aug 20 17:23:21 linux systemd[1]: azure-li-user.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 17:23:21 linux systemd[1]: Failed to start Setup of Azure Li/VLi workload user.
Aug 20 17:23:21 linux systemd[1]: azure-li-user.service: Unit entered failed state.
Aug 20 17:23:21 linux systemd[1]: azure-li-user.service: Failed with result 'exit-code'.
Warning: azure-li-user.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #

Looks like the code is having an issue reaching out to get the private key (although similar code in script execution, for example, seems to work).

Please advise and let me know, thanks.

Need capability to select, via YAML, if crash dump should be enabled

On VLI systems with 12+TB of RAM, crash dumps can take quite a while.

Is it possible, within the YAML file, to select if crash dumps are enabled or disabled? Perhaps something like:

crash-dump: enabled | disabled

If not specified in YAML, then crash dump should be as is (enabled).

saptune sequence is wrong

Talking to Peter Schinagl outlined that the sequence we use to start the saptune service is wrong. The suggestion for fixing the implementation is as follows:

systemctl enable tuned
systemctl start tuned
saptune daemon start
saptune solution apply HANA

There is no need to set any tuned profile, this is done by saptune itself

I mark this as a bug unless objections on changing this as suggested

Error handling on service failure

Description

In the scope of the azure li services there is potential that something fails or does not match the machine requirements. In this case a procedure to inform the user and the ability to access the system is desired.

LI Testing: kmod-usnmic_verbs is missing from pacakges installed

Note the following output:

soltyo32:~†# rpm -qa | grep -i kmod
kmod-compat-17-9.6.1.x86_64
libkmod2-17-9.6.1.x86_64
* kmod-usnic_verbs-1.1.545.8.sles12sp3-1.x86_64
kmod-17-9.6.1.x86_64

The line marked with "*" is missing on the test image, and we normally have this (and check for it). What is this, and should we have it?

Take YAML file from DVD as fallback source

We're still having problems mounting LUNs reliably from our Linux blade (will follow up with that offline after additional testing).

However, I've since learned that Azure data centers get these often use DVD drives to get data. As we're moving more towards automation of infrastructure, it's becoming easier for us to make a DVD visible to a blade upon boot.

Would it be possible to look for a LUN (as existing code does) for the YAML and, if the LUN doesn't exist, then look at the local DVD drive on the system? That gives us an option and might be easier for us in the future.

Please let me know, thanks.

Don't forget to remove the password for root

Currently we allow the image root login on the console via an insecure password.

This issue exists to not forget to remove the password for root once the image is handed over to customers. Also delete the pre authorized_keys setup from the overlay files archive

Devel:PubCloud:Azure:ImagesSLE12/SLES12-SP3-SAP-Azure-LI-BYOS : config.sh
Devel:PubCloud:Azure:ImagesSLE12/SLES12-SP3-SAP-Azure-LI-BYOS : root.tar.gz

Problem with cleanup service with image 1.0.35

Back from (yet another) vacation, testing image 1.0.35. Sorry for delay.

Upon booting the blade with image 1.0.35, no errors where shown on the console. Yet, when I took a look at the services, I observed this:

azurehost:~ # systemctl -a | grep azure-li
  azure-li-call.service                                                                                                loaded    inactive dead      Setup of Azure Li/VLi Script Caller
* azure-li-cleanup.service                                                                                             loaded    failed   failed    Cleanup/Uninstall Azure Li/VLi services
  azure-li-config-lookup.service                                                                                       loaded    inactive dead      Lookup and import Azure Li/VLi config file
  azure-li-install.service                                                                                             loaded    inactive dead      Installation of custom Azure Li/VLi addon packages
  azure-li-machine-constraints.service                                                                                 loaded    inactive dead      Validation of Azure Li/VLi machine constraints
  azure-li-network.service                                                                                             loaded    inactive dead      Setup of Azure Li/VLi network configuration
  azure-li-report.service                                                                                              loaded    inactive dead      Report status of Azure Li/VLi services
  azure-li-storage.service                                                                                             loaded    inactive dead      Setup of Azure Li/VLi Storage Mountpoints
  azure-li-system-setup.service                                                                                        loaded    inactive dead      System Setup of Azure Li/VLi machine
  azure-li-user.service                                                                                                loaded    inactive dead      Setup of Azure Li/VLi workload user
azurehost:~ # systemctl -l status azure-li-cleanup
* azure-li-cleanup.service - Cleanup/Uninstall Azure Li/VLi services
   Loaded: loaded (/usr/lib/systemd/system/azure-li-cleanup.service; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2018-08-14 17:20:37 UTC; 6min ago
  Process: 53465 ExecStart=/usr/bin/azure-li-cleanup (code=exited, status=1/FAILURE)
 Main PID: 53465 (code=exited, status=1/FAILURE)

Aug 14 17:20:32 azurehost systemd[1]: Starting Cleanup/Uninstall Azure Li/VLi services...
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: Traceback (most recent call last):
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]:   File "/usr/bin/azure-li-cleanup", line 11, in <module>
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/cleanup.py", line 38, in main
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]:   File "/usr/lib/python3.4/site-packages/azure_li_services/defaults.py", line 46, in get_service_reports
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: ImportError: No module named 'azure_li_services.status_report'
Aug 14 17:20:37 azurehost systemd[1]: azure-li-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Aug 14 17:20:37 azurehost systemd[1]: Failed to start Cleanup/Uninstall Azure Li/VLi services.
Aug 14 17:20:37 azurehost systemd[1]: azure-li-cleanup.service: Unit entered failed state.
Aug 14 17:20:37 azurehost systemd[1]: azure-li-cleanup.service: Failed with result 'exit-code'.
Warning: azure-li-cleanup.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #

It looks like the cleanup service had some sort of a problem referencing a module. Thoughts?

LI Testing: No separate /boot partition

This isn't necessarily a problem, but a discussion topic.

We've noted that our SP3 image does not create a separate /boot partition. Now, in yesteryear, it was common to have a separate /boot partition, partly for "safety" partly to insure there was always a bit of free space there "just in case", etc. In the past, we've worked with systems that create this.

What's common practice for a separate /boot partition these days? Does SAP/HANA give specific guidance in this area?

Service azure-li-storage signals error when none exists

Due to problems with azure-li-machine-constraints, my /etc/issue file has:

linux:~ # cat /etc/issue

!!! DEPLOYMENT ERROR !!!
For details see: "systemctl status azure-li-machine-constraints azure-li-storage"

linux:~ #

Looking at azure-li-storage, I see no problems (indeed, my YAML doesn't configure that yet - I'm testing that next):

linux:~ # systemctl status azure-li-storage
* azure-li-storage.service - Setup of Azure Li/VLi Storage Mountpoints
   Loaded: loaded (/usr/lib/systemd/system/azure-li-storage.service; bad; vendor preset: disabled)
   Active: inactive (dead) since Tue 2018-06-19 15:27:25 UTC; 29min ago
 Main PID: 26652 (code=exited, status=0/SUCCESS)

Jun 19 15:27:25 linux systemd[1]: Starting Setup of Azure Li/VLi Storage Mountpoints...
Jun 19 15:27:25 linux systemd[1]: Started Setup of Azure Li/VLi Storage Mountpoints.
Warning: azure-li-storage.service changed on disk. Run 'systemctl daemon-reload' to reload units.
linux:~ #

Why is this mentioned in /etc/issue when it's not actually a problem?

Constraint checker service

Add support for handling the machine_constraints section

machine_constraints:
  min_cores: 32
  min_memory: "20tb"

VLI: ISO files vs. IMG files (Unable to boot ISO file as a USB Image)

I'm looking at our VLI systems and how we'll be booting for legacy (Gen3) systems.

I am NOT able to boot an ISO file as a USB image. For virtual devices, I have three choices:

Floppy Image
ISO Image
HD/USB Image (under section "Hard disk/USB Key")

If I try to select an .iso file for USB Key, I get an error: "Invalid image file. Cannot redirect Hard disk/USB Key Image."

That error does NOT occur if I mount the .ISO file as an ISO image.

I'm unfamiliar with these imaging styles. How is a .IMG file different than a .ISO file? Can I convert from one to the other? Or can you provide a .IMG file that I can try to work with?

Thanks in advance.

Disk/NFS Mounting Service

In discussions with the OPS team, we currently mount a number of NFS volumes. So when we mount disks, it must support NFS, and it must support special options that you'd specify on the command line.

I've asked our operations team for output from /etc/fstab, and will post that here once I get it. That should allow us to fully understand the requirements.

Service azure-li-user fails if storing a private key

If I'm creating a private key on an alternate user (say hanatest), with the addition of a line like:

  ssh-private-key: "ssh/id_dsa_netapp"

to a user in the credentials section, then I get an error upon execution of that service at boot-up:

azurehost:~ # systemctl status -l azure-li-user
* azure-li-user.service - Setup of Azure Li/VLi workload user
   Loaded: loaded (/usr/lib/systemd/system/azure-li-user.service; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-08-15 15:51:38 UTC; 3min 33s ago
 Main PID: 37503 (code=exited, status=1/FAILURE)

Aug 15 15:51:38 linux azure-li-user[37503]:     load_entry_point('azure-li-services==1.1.5', 'console_scripts', 'azure-li-user')()
Aug 15 15:51:38 linux azure-li-user[37503]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 52, in main
Aug 15 15:51:38 linux azure-li-user[37503]:     setup_ssh_authorization(user)
Aug 15 15:51:38 linux azure-li-user[37503]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 143, in setup_ssh_authorization
Aug 15 15:51:38 linux azure-li-user[37503]:     Command.run(['umount', ssh_key_source.location])
Aug 15 15:51:38 linux azure-li-user[37503]: UnboundLocalError: local variable 'ssh_key_source' referenced before assignment
Aug 15 15:51:38 linux systemd[1]: azure-li-user.service: Main process exited, code=exited, status=1/FAILURE
Aug 15 15:51:38 linux systemd[1]: Failed to start Setup of Azure Li/VLi workload user.
Aug 15 15:51:38 linux systemd[1]: azure-li-user.service: Unit entered failed state.
Aug 15 15:51:38 linux systemd[1]: azure-li-user.service: Failed with result 'exit-code'.
Warning: azure-li-user.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #

If you can take a look at this, that would be great, thanks.

LI Testing: Cisco fnic drivers appear to be missing (enic drivers are installed)

In the past, we've always had both Cisco enic and Cisco fnic drivers installed on LI systems. For some reason, this image doesn't appear to have that:

soltyo32:~ # rpm -qa | grep -i cisco
cisco-enic-usnic-kmp-default-3.0.44.553.545.8_k4.4.73_5-1.x86_64
* cisco-fnic-kmp-default-1.6.0.36_k4.4.73_5-1.x86_64

(Line marked with "*" does not exist on the test image)

Only enic is showing up.

Note: When we do "modinfo fnic", the driver does show up (but version 1.6.0.34 instead of 1.6.0.36).
In the past, we've had RPMs for both enic and fnic.

This may be a Cisco issue (I'm not sure). If so, let me know and I'll involve them.

SSH configuration for access to storage

Final issue that came up in meetings with Operations:

We need to configure SSH for access to storage. This is done by SSH public/private key configuration:

SSH key is generated from O/S
- Private key stored in /root/.ssh/id_dsa
- Public key is added to storage

This is currently done with a command like ssh-kengen -t dsa -b xxx.

It would be a pain to get the public key off of the system (after generating the password hashes, we don't save the source password(s), if any). So I propose we generate the key elsewhere and then add the private key to the YAML file to be stored in /root/.ssh/id_dsa. Perhaps, in the credentials section, add:

username: root
...
private_key:
  name: /root/.ssh/id_dsa
  ssh-key: "XXXXX"

Thoughts?

VLI: Systems may not reboot without operator intervention

In certain cases, our VLI systems may not reboot automatically without operator intervention. This is a problem if a customer tries to reboot a system and would then need to contact operations to get the system back up.

In chatting with HPE about this, one of their engineers said:

Autorebooting - startup.nsh is just one option. HPE has no opinion on a preferred method of autobooting. So your choice

startup.nsh has a decided advantage of surviving both bios updates and ‘power -c reset’. The latter being a useful tool for clearing up some ‘logical’ problems and slso required after a BIOS update. So, in my opinion, startup.nsh is preferred over setting a BIOS boot option through the Bios menu.

So while HPE themselves has no opinion on setting up autoboot, their engineers like it, and so do I. We do this as follows:

echo "fs0:\efi\sles_sap\grubx64.efi" > /boot/efi/startup.nsh

Weird that the file as \ characters in it, but I imagine you folks understand this better than I do. We'd like this to be set up automatically for VLI platforms.

NOTE: This is specific to VLI - Very Large Instance - platforms!

system_setup: energy performance settings

Energy Performance Settings for sap hana instances is recommended to use:

# CPU Frequency/Voltage scaling
cpupower frequency-set -g performance

# low latency/maximum performance
cpupower set -b 0

Add this for persistency into /etc/init.d/boot.local. The image has the rc-local service activated which causes the reading of that file.

LI Testing: enic driver version appears to be old?

Our image has enic driver version 2.3.0.31. In the past, we've always operated with 2.3.0.44.

Has Cisco not submitted their latest driver to SUSE for inclusion into the driver database, or is something else going on? Thanks.

system_setup: kernel samepage merging

In the scope of sap hana ksmd should not run

#  stop ksmd from running but keep merged pages,
echo 0 > /sys/kernel/mm/ksm/run

add this for persistency to /etc/init.d/boot.local. The image has the rc-local service activated which causes the reading of that file.

LI Testing: Missing SuSEfirewall2 configuration

We generally configure the SUSE firewall as follows:

# cat /etc/sysconfig/SuSEfirewall2 | grep 112
FW_SERVICES_EXT_TCP="22 53 3128 1128"
FW_SERVICES_EXT_UDP="53 1129"

I can get you more data (like the entire file) if needed, just let me know.

What's recommended practice these days for SUSE firewall configuration?

LI Testing: Missing Packages

In the SUSE Linux Enterprise Server 12.x for SAP Applications Configuration Guide for SAP HANA, in Section 8.1, there is a package list of packages that should be installed. We are missing some of these.

While this list is not exhaustive, we are missing at least:

iptraf
findutils-locate
audit-libs
keyutils-libs
perl-Time-Piece (needed for backup scripts)
syslog-ng

I can go through for a more exhaustive list, if needed. But based on our docs, we need those packages for support or proper operations.

Can you please review section 8.1 and let me know your thoughts, thanks.

Mounting of LUN with Azure Li config file

Would it be possible that the filesystem which contains the config file can be created with a persistent label name we can rely on ? That would allow us to do a simple label based mount and we don't have to search the device. For example

mount --label azconfig /mount_point

Depending on which filesystem you choose the volume label has a size limitation, usually 16bytes
I'm open for any name which works for you.

Thanks for your feedback

Service azure-li-user fails in 1.0.38-Build2.2

It seems like service azure-li-user is a bear to work out, it's been problematic in the past three builds.

In the latest build, the service still fails with the following error:

azurehost:~ # systemctl status -l azure-li-user
* azure-li-user.service - Setup of Azure Li/VLi workload user
   Loaded: loaded (/usr/lib/systemd/system/azure-li-user.service; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-08-22 18:02:10 UTC; 1min 56s ago
 Main PID: 37598 (code=exited, status=1/FAILURE)

Aug 22 18:02:10 linux azure-li-user[37598]: Traceback (most recent call last):
Aug 22 18:02:10 linux azure-li-user[37598]:   File "/usr/bin/azure-li-user", line 11, in <module>
Aug 22 18:02:10 linux azure-li-user[37598]:     load_entry_point('azure-li-services==1.1.7', 'console_scripts', 'azure-li-user')()
Aug 22 18:02:10 linux azure-li-user[37598]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 74, in main
Aug 22 18:02:10 linux azure-li-user[37598]:     raise AzureHostedException(user_setup_errors)
Aug 22 18:02:10 linux azure-li-user[37598]: azure_li_services.exceptions.AzureHostedException: [AzureHostedCommandException('umount: stderr: umount: /mnt: target is busy\n        (In some cases useful info about processes that\n         use the device is found by lsof(8) or fuser(1).)\n, stdout: (no output on stdout)',)]
Aug 22 18:02:10 linux systemd[1]: azure-li-user.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 18:02:10 linux systemd[1]: Failed to start Setup of Azure Li/VLi workload user.
Aug 22 18:02:10 linux systemd[1]: azure-li-user.service: Unit entered failed state.
Aug 22 18:02:10 linux systemd[1]: azure-li-user.service: Failed with result 'exit-code'.
Warning: azure-li-user.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #

Seems like it's trying to unmount the YAML LUN. That's going to be problematic; I guess to unmount you'd need to make sure all other services are done, and that there's no timing/race conditions. Perhaps it's safest to handle this in the cleanup service, which runs after everything else is completed?

I'm still trying to get a viable build for PM testing. Once this is fixed, please generate a new .xz file for me to retest. Thanks.

Math problems in azure-li-machine-constraints

ON the latest test image (v1.0.34), I encountered:

azurehost:~ # systemctl status -l azure-li-machine-constraints
* azure-li-machine-constraints.service - Validation of Azure Li/VLi machine constraints
   Loaded: loaded (/usr/lib/systemd/system/azure-li-machine-constraints.service; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2018-07-19 21:31:08 UTC; 57min ago
 Main PID: 37535 (code=exited, status=1/FAILURE)

Jul 19 21:31:08 linux azure-li-machine-constraints[37535]:     load_entry_point('azure-li-services==1.1.4', 'console_scripts', 'azure-li-machine-constraints')()
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/machine_constraints.py", line 45, in main
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]:     machine_constraints
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]:   File "/usr/lib/python3.4/site-packages/azure_li_services/units/machine_constraints.py", line 75, in check_main_memory_validates_constraint
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]:     min_bytes, binary=True
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: azure_li_services.exceptions.AzureHostedMachineConstraintException: Main memory: 1.48 TiB is below required minimum: 1.5 TiB
Jul 19 21:31:08 linux systemd[1]: azure-li-machine-constraints.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 21:31:08 linux systemd[1]: Failed to start Validation of Azure Li/VLi machine constraints.
Jul 19 21:31:08 linux systemd[1]: azure-li-machine-constraints.service: Unit entered failed state.
Jul 19 21:31:08 linux systemd[1]: azure-li-machine-constraints.service: Failed with result 'exit-code'.
Warning: azure-li-machine-constraints.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #

The important part: Main memory: 1.48 TiB is below required minimum: 1.5 TiB.

Very very close. The BIOS states that I have 1.5TB, but obviously, the O/S thinks ever so slightly differently.

We can compensate for this by reducing what we believe the memory size to be to memory size - 0.05 TB. Or maybe machine constraints should have a "noise cancellation" for differences?

Services abort upon error, not performing followup operations

Due to diagnosis of Issue #70 , I learned that some of our services (clearly the user service, perhaps the network service and other services as well) abort upon error, and don't perform further processing.

This was particularly exposed when creation of a user resulted in an error, and other users independent to that were not created either, preventing me from logging into the machine.

I think it would be useful to continue processing on error. BUT: We still want to capture the error that occurred and be able to report on it (via systemctl or some other mechanism). If, for no other reason, that prevents a "try this YAML", oh, fix one thing and "try this new YAML", only to learn that there may be a series of problems with it. "Trying a YAML" takes quite a bit of time due to boot times of these very large systems (memory tests, at least on VLI systems, can't be disabled, resulting in 20+ minute post times).

I think it's super important that, when an error occurs, we the failure provides enough information to diagnose the problem. That's probably top priority. But if we can retain that and perform further YAML processing, I think that would be useful.

Knowing this is the case, I can try to organize YAML to reduce this problem. But that's just a reduction of the problem - it's still a problem. And in some cases (like the network service), I really do want to mount all four or five NFS shares, in case problems exist with further mounts (if an error occurs early on).

Implement schema and validation service

Related to #29

When we are "done" with the YAML and all sections are implemented we should add a schema and add a validation service for the received configuration.

system_setup: add kernel crash dump configuration

We had a recent customer issue where the O/S crashed due to a panic. Unfortunately, the machine was not configured for kernel dumps, so we were unable to capture what happened, much less why.

Are SAP Hana images configured for kernel dumps (dump level 31) by default? If not, can we do that?

Need to set hostname of the system

I've finally been able to meet with operations team and discuss compute setup. This will generate a number of new issues. Sorry it took so long, it's hard to get meetings with operations team, and we discussed a long list of topics (storage setup, network setup, compute setup, etc).

First on the list: We need a way to set the hostname of the system.

It does not appear necessary, based on my information, to set a domain name, or to otherwise configure /etc/hosts file.

Clean-up of system after System Configuration

There is currently an E-Mail discussion on handling of the YAML file upon completion of system configuration. I thought it would be best to continue that here.

In addition, after configuration, some run-once services are left around for azure-li-services as well. While it did offer some debugging capabilities, I think such debugging could be offered by log files as well. At first glance, I don't like the idea of run-once services left around. What if an administrator inadvertently runs one? Would any of them do damage to an already configured system?

Things to consider cleanup of include:

YAML file, currently copied to /etc after it's found, is left around. Should it be deleted instead (I think yes, but it's a discussion issue),
Service scripts (azure-li-services, azure-li-services-network, and more to follow) are left around. Should they be removed as part of configuration cleanup?
Other artifacts of system configuration (other than log files) that may be in place.

I think any artifacts of the system configuration should be carefully considered. I like log files left around for diagnosis of problems, but I propose that all other configuration artifacts be removed during configuration cleanup.

Credentials in the YAML file

The YAML file currently has this segment for credentials:

credentials:
  username: "hanauser"
  password: "somerandompassword" (should this be base-64 encoded to allow for special characters?)
  ssh-key: "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCt0tiRjcZE0XU38M6aBxaRd+fsVUnezP8/fStL4vgxuzu84FFTTK6XIVsMWVbDxROVA0H98Sexfm8ZDgxYxQ5g6Nk1wbWxB9wbPHrykJScV3tUGWh4RgDq4rm/8+k7Msd32Ubj2t5962W1h1s+zhax4io0bSvJtKj9Ke1Bcq86zEcFg3iPIyxSY8oKK02luT3m73/0L+2BzVXzSsUwE23Nh37+jVAWt+aXHJPlg7YNJGN/9luqx90I3BiPDRtsj4AZ8OtYvBgpDbGK8cVMEDGvOg0+a8JE8oebzgzArYNlWskCRvdvVPU3rk8GKjSKDC2j5lS5sjiiT342Qc3VjoScg5pNfT5fBQ2KI3+LQW9UGwtQwX4er+rjVMM72f1BRI5pIBuwWQIQJzwlqF/KUZcCC5Ly8H8S96thxpmsnXz89GOfldFbQQa/IP5dqCERFpncg1stnDjtGUGU7LIBcBkxqC0ixHgsNKlqFkpT/2oP3y8J875sxykKcigq2Gzw8avXH0W/7nA2pGIrqMMNvMs4ORmSi0FufwNnMsmJYW/baOPUgO+yzUeWLL9wajFo7ozoXz/Q1mLhd8791XO8+ph1Txs0smqV0BK7l/8rAOfhJB1t/aLIB1D+AuVxkfFXUDCcTT4LHSpS7LgI+vsegcN5rPPdFucgMoxLncEC5Fwfww== Jeff Coffler Common Key"

Anand wasn't comfortable with the raw password being in this file. His concern is that if the file is forwarded around in E-Mail, we're forwarding around credentials, which would be unfortunate.

I propose a new field in the credentials section: password_type. It would have three possible values:

plaintext: Plaintext passwords. Special characters in password (like " character) might cause problems,
base64: Password is Base 64 encoded, and should be decoded,
external: In this case, password is in an external file on the same volume as the YAML file. The password field would contain the name of the file (relative to /etc since that's where the YAML file is).

So, for an example with plaintext:

credentials:
  username: "hanauser"
  password: "somerandompassword"
  password_type: plaintext

An example with base64:

credentials:
  username: "hanauser"
  password: "c29tZXJhbmRvbXBhc3N3b3Jk"
  password_type: base64

Finally, an example with external:

credentials:
  username: "hanauser"
  password: "password_file"
  password_type: external

In this case, since password_file isn't an absolute path, we'd look for /etc/password_file, and we would expect to find something like:

password: "c29tZXJhbmRvbXBhc3N3b3Jk"

It would be assumed that any password_type of type external would always be stored in Base 64 format.

Is it possible to do this?

kexec reboot for activation of crash memory range

In case the crash kernel memory range was adapted such that the current memory range is different from the new configuration, a reboot of the system (reload of the kernel) is required.

The calculation of the crash memory range is based on the value of kdump calibrate

Use the low value from the kdump calibration
Use (RECOMMENDATION * RAM_IN_TB) for the high value if RAM_IN_TB is > 1

If the values are different compared to the current crash setup we kexec reboot the system as last action of the cleanup service

Machine constraints not checked properly in image v1.0.30

Give machine constraints like:

machine_constraints:
  cpu: 73
  memory: "1.5tb"

I expected an error:

linux:~ # grep processor /proc/cpuinfo | wc -l
72
linux:~ #

No error at the login prompt on the console. Should I be looking for errors elsewhere?

linux:~ # cat /var/lib/azure_li_services/machine_constraints.report.yaml
machine_constraints:
  success: true
linux:~ #

Cleanup of the YAML file

The YAML file, when I started, was adopted from others. As it's being developed, I note that there are fields in there that serve no purpose.

version: date field. We have no versioning, but the schema should allow for that. Do we want a date field, or a number here (like version: 1)? We should go on ignoring it for now, but let the field remain in case we need it at a future time.
blade: entry (with fields under it). I'm not sure that this serves a purpose at all, but removal will result in code changes for both of us. Remove?
sku: is a placeholder field for us. azure-li-services will ignore it, but we'll continue to specify that.
cpu: is ignored right now. This could serve as a minimum (much like min_size under storage), but seems redundant given that sku has that. Remove?
memory is ignored right now. Same as cpu, remove?
time_server. Our colos don't have a time server. If one was set up, that would be customer specific, and wouldn't be something that would be generally offered (so customer would need to set that up). Remove?

Since all of this predates me, I'm going to reach out to my team and get feedback on these issues. I don't want to remove something that might be important for some reason I'm unaware of.

Package service can fail without triggering error state

A bad YAML specification (without directory: tag) will cause the package service to not run properly, but also not trigger an error.

If packages: tag is found, then any problems should result in an error state.

User creation is not correct in image v1.0.30 (.ssh permissions)

Multiple bugs here, but all related to user creation:

Given a YAML file containing this segment:

credentials:
 -
  username: "hanauser"
  shadow_hash: $6$/i4bVTW/Ec/0o.eU$6BFOeP9xn4ZqoDzYbBjuZZupe9VpiqAn9wrXHmZ.Rcyyl.AxQSzmja75OOupHAuzUQ1WaI8MIUqLKdavT252t/
  ssh-key: "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCt0tiRjcZE0XU38M6aBxaRd+fsVUnezP8/fStL4vgxuzu84FFTTK6XIVsMWVbDxROVA0H98Sexfm8ZDgxYxQ5g6Nk1wbWxB9wbPHrykJScV3tUGWh4RgDq4rm/8+k7Msd32Ubj2t5962W1h1s+zhax4io0bSvJtKj9Ke1Bcq86zEcFg3iPIyxSY8oKK02luT3m73/0L+2BzVXzSsUwE23Nh37+jVAWt+aXHJPlg7YNJGN/9luqx90I3BiPDRtsj4AZ8OtYvBgpDbGK8cVMEDGvOg0+a8JE8oebzgzArYNlWskCRvdvVPU3rk8GKjSKDC2j5lS5sjiiT342Qc3VjoScg5pNfT5fBQ2KI3+LQW9UGwtQwX4er+rjVMM72f1BRI5pIBuwWQIQJzwlqF/KUZcCC5Ly8H8S96thxpmsnXz89GOfldFbQQa/IP5dqCERFpncg1stnDjtGUGU7LIBcBkxqC0ixHgsNKlqFkpT/2oP3y8J875sxykKcigq2Gzw8avXH0W/7nA2pGIrqMMNvMs4ORmSi0FufwNnMsmJYW/baOPUgO+yzUeWLL9wajFo7ozoXz/Q1mLhd8791XO8+ph1Txs0smqV0BK7l/8rAOfhJB1t/aLIB1D+AuVxkfFXUDCcTT4LHSpS7LgI+vsegcN5rPPdFucgMoxLncEC5Fwfww== Jeff Coffler Common Key"
 -
  username: "root"
  shadow_hash: $6$/i4bVTW/Ec/0o.eU$6BFOeP9xn4ZqoDzYbBjuZZupe9VpiqAn9wrXHmZ.Rcyyl.AxQSzmja75OOupHAuzUQ1WaI8MIUqLKdavT252t/
  ssh-key: "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCt0tiRjcZE0XU38M6aBxaRd+fsVUnezP8/fStL4vgxuzu84FFTTK6XIVsMWVbDxROVA0H98Sexfm8ZDgxYxQ5g6Nk1wbWxB9wbPHrykJScV3tUGWh4RgDq4rm/8+k7Msd32Ubj2t5962W1h1s+zhax4io0bSvJtKj9Ke1Bcq86zEcFg3iPIyxSY8oKK02luT3m73/0L+2BzVXzSsUwE23Nh37+jVAWt+aXHJPlg7YNJGN/9luqx90I3BiPDRtsj4AZ8OtYvBgpDbGK8cVMEDGvOg0+a8JE8oebzgzArYNlWskCRvdvVPU3rk8GKjSKDC2j5lS5sjiiT342Qc3VjoScg5pNfT5fBQ2KI3+LQW9UGwtQwX4er+rjVMM72f1BRI5pIBuwWQIQJzwlqF/KUZcCC5Ly8H8S96thxpmsnXz89GOfldFbQQa/IP5dqCERFpncg1stnDjtGUGU7LIBcBkxqC0ixHgsNKlqFkpT/2oP3y8J875sxykKcigq2Gzw8avXH0W/7nA2pGIrqMMNvMs4ORmSi0FufwNnMsmJYW/baOPUgO+yzUeWLL9wajFo7ozoXz/Q1mLhd8791XO8+ph1Txs0smqV0BK7l/8rAOfhJB1t/aLIB1D+AuVxkfFXUDCcTT4LHSpS7LgI+vsegcN5rPPdFucgMoxLncEC5Fwfww== Jeff Coffler Common Key"

I can't seem to log into user root using the SSH key. Examination shows that directory /root/.ssh does not exist.
I can't seem to log in to user hanauser using SSH key. Examination shows that directory /home/hanauser/.ssh is owned by root.
After fixing owner of /home/hanauser/.ssh, I still couldn't log in by key. /home/hanauser/.ssh/authorized_keys is also owned by root.

LI Testing: Missing netapp configuration file

We've always had a NetApp configuration file, as follows:

sles12sp3:~ # cat /etc/sysctl.d/91-hana-netapp.conf
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216
net.ipv4.tcp_rmem = 65536 16777216 16777216
net.ipv4.tcp_wmem = 65536 16777216 16777216

Our test system doesn't have that. Who's responsible for installation/configuration of that file?

system_setup: setup and start sap tune service

We do a number of saptune operations for optimization for HANA:

saptune daemon start
saptune solution apply HANA
tuned-adm profile sap-hana
systemctl enable tuned
systemctl start tuned

This procedure should be included into the system setup service and must be called after the package installer service

Build image to support UEFI boot

In discussion of boot performance with Cisco, they commented that their blades boot much more slowly in MS-DOS compatibility mode, and suggested that use of UEFI would be much faster. They also stipulated that they didn't believe Linux supported UEFI.

Can the SUSE O/S image support booting via UEFI? What's involved for that?

Runtime checker for extended config file validation

We need a runtime check, if storage is configured then we also need to handle storage authentication. This currently implies we need to have the 'ssh-private-key' entry in the YAML. This is related to discussion #42

Basic optimization for HANA instances

I've been looking into verification tests that we run since this project is rapidly wrapping up for LI. After investigation, there are some issues.

In brief conversations with @rjschwei earlier, he's hesitant to optimize the system in any way since "perfect" optimizations are often load dependent and thus should be performed by the customer. While that sounds good in theory, we seem to be missing some optimizations that I consider pretty basic.

For example:

We currently have some settings via saptune (this seems to be a SUSE-specific thing; I never heard of that before). saptune is installed but not running or configured for HANA on our system.
We have some settings in /etc/init.d/boot.local, like:

# cat /etc/init.d/boot.local
cpupower frequency-set -g performance
cpupower set -b 0
echo 0 > /sys/kernel/mm/ksm/ru

Seems reasonable enough to if CPU set for performance, for example, rather than power efficiency.

We have some boot options in /proc/cmdline, which I imagine had to come from SAP OSS notes:

BOOT_IMAGE=/vmlinuz-4.4.120-94.17-default root=UUID=8c094969-faeb-4bd0-af46-ee2f78bdf0c9 resume=/dev/mapper/3600a0980383044456a2b4b596f346d70-part3 splash=silent quiet showopts 
numa_balancing=disable transparent_hugepage=never intel_idle.max_cstate=1 processor.max_cstate=1

I'm still picking up what I need to know about image verification, so there may be other items.

Despite what @rjschwei had to say, this sort of stuff seems pretty basic. I would very much like to see images come this way (rather than writing a script, executed through YAML, that make the appropriate changes).

What say ye?

Fix user service to allow updates for existing accounts

The user service does not allow me to update the root account since it already exists.

It should allow for this (password changes, adding of SSH key).

This is as per discussion in private E-Mail.

Location and name of config file on LUN

Would it be possible that the config file name and location on the filesystem is at a fixed position and name. This would avoid a filesystem search and leads to faster code

One suggestion from Jeff was

<lun_mount_point>/suse_firstboot_config.yaml

That would work for me but I'm also open for any name which works best for you

suse-enceladus / azure-li-services Goto Github PK

azure-li-services's People

Contributors

Stargazers

Watchers

azure-li-services's Issues

Recommend Projects

Recommend Topics

Recommend Org