suse-enceladus / azure-li-services Goto Github PK
View Code? Open in Web Editor NEWAzure Large Instance Services
License: GNU General Public License v3.0
Azure Large Instance Services
License: GNU General Public License v3.0
Due to diagnosis of Issue #70 , I learned that some of our services (clearly the user service, perhaps the network service and other services as well) abort upon error, and don't perform further processing.
This was particularly exposed when creation of a user resulted in an error, and other users independent to that were not created either, preventing me from logging into the machine.
I think it would be useful to continue processing on error. BUT: We still want to capture the error that occurred and be able to report on it (via systemctl
or some other mechanism). If, for no other reason, that prevents a "try this YAML", oh, fix one thing and "try this new YAML", only to learn that there may be a series of problems with it. "Trying a YAML" takes quite a bit of time due to boot times of these very large systems (memory tests, at least on VLI systems, can't be disabled, resulting in 20+ minute post times).
I think it's super important that, when an error occurs, we the failure provides enough information to diagnose the problem. That's probably top priority. But if we can retain that and perform further YAML processing, I think that would be useful.
Knowing this is the case, I can try to organize YAML to reduce this problem. But that's just a reduction of the problem - it's still a problem. And in some cases (like the network service), I really do want to mount all four or five NFS shares, in case problems exist with further mounts (if an error occurs early on).
From PR #27
regarding credentials we need another issue to raise an error if no credentials are specified. Could you ??>please open that one ? Thanks
If no credentials tag exists in the YAML, that should be an error, since that would mean that you can't log into the machine in the end.
There is currently an E-Mail discussion on handling of the YAML file upon completion of system configuration. I thought it would be best to continue that here.
In addition, after configuration, some run-once services are left around for azure-li-services as well. While it did offer some debugging capabilities, I think such debugging could be offered by log files as well. At first glance, I don't like the idea of run-once services left around. What if an administrator inadvertently runs one? Would any of them do damage to an already configured system?
Things to consider cleanup of include:
YAML file, currently copied to /etc after it's found, is left around. Should it be deleted instead (I think yes, but it's a discussion issue),
Service scripts (azure-li-services
, azure-li-services-network
, and more to follow) are left around. Should they be removed as part of configuration cleanup?
Other artifacts of system configuration (other than log files) that may be in place.
I think any artifacts of the system configuration should be carefully considered. I like log files left around for diagnosis of problems, but I propose that all other configuration artifacts be removed during configuration cleanup.
Talking to Peter Schinagl outlined that the sequence we use to start the saptune service is wrong. The suggestion for fixing the implementation is as follows:
There is no need to set any tuned profile, this is done by saptune itself
I mark this as a bug unless objections on changing this as suggested
In discussions with the OPS team, we currently mount a number of NFS volumes. So when we mount disks, it must support NFS, and it must support special options that you'd specify on the command line.
I've asked our operations team for output from /etc/fstab
, and will post that here once I get it. That should allow us to fully understand the requirements.
This had come up with in a lunch conversation with our HPE vendor, and I had this on my list to look into. Then, it turns out, it came up during meetings with Operations on compute setup:
We need to disable EDAC in the O/S for both LI and VLI as the underlying hardware handles this. This is done by:
modprobe -r sb_edac edac_core
vim /etc/modprobe.d/blacklist.conf and add below lines:
blacklist sb_edac
blacklist edac_core
Can this capability be added? Would this need YAML extension, or could you just handle it automatically since it's done for both LI and VLI?
Give machine constraints like:
machine_constraints:
cpu: 73
memory: "1.5tb"
I expected an error:
linux:~ # grep processor /proc/cpuinfo | wc -l
72
linux:~ #
No error at the login prompt on the console. Should I be looking for errors elsewhere?
linux:~ # cat /var/lib/azure_li_services/machine_constraints.report.yaml
machine_constraints:
success: true
linux:~ #
On VLI systems with 12+TB of RAM, crash dumps can take quite a while.
Is it possible, within the YAML file, to select if crash dumps are enabled or disabled? Perhaps something like:
crash-dump: enabled | disabled
If not specified in YAML, then crash dump should be as is (enabled).
Our image has enic driver version 2.3.0.31
. In the past, we've always operated with 2.3.0.44
.
Has Cisco not submitted their latest driver to SUSE for inclusion into the driver database, or is something else going on? Thanks.
The YAML file, when I started, was adopted from others. As it's being developed, I note that there are fields in there that serve no purpose.
version:
date field. We have no versioning, but the schema should allow for that. Do we want a date field, or a number here (like version: 1
)? We should go on ignoring it for now, but let the field remain in case we need it at a future time.
blade:
entry (with fields under it). I'm not sure that this serves a purpose at all, but removal will result in code changes for both of us. Remove?
sku:
is a placeholder field for us. azure-li-services will ignore it, but we'll continue to specify that.
cpu:
is ignored right now. This could serve as a minimum (much like min_size
under storage), but seems redundant given that sku
has that. Remove?
memory
is ignored right now. Same as cpu
, remove?
time_server
. Our colos don't have a time server. If one was set up, that would be customer specific, and wouldn't be something that would be generally offered (so customer would need to set that up). Remove?
Since all of this predates me, I'm going to reach out to my team and get feedback on these issues. I don't want to remove something that might be important for some reason I'm unaware of.
Would it be possible that the filesystem which contains the config file can be created with a persistent label name we can rely on ? That would allow us to do a simple label based mount and we don't have to search the device. For example
mount --label azconfig /mount_point
Depending on which filesystem you choose the volume label has a size limitation, usually 16bytes
I'm open for any name which works for you.
Thanks for your feedback
A bad YAML specification (without directory:
tag) will cause the package service to not run properly, but also not trigger an error.
If packages:
tag is found, then any problems should result in an error state.
The YAML file currently has this segment for credentials:
credentials:
username: "hanauser"
password: "somerandompassword" (should this be base-64 encoded to allow for special characters?)
ssh-key: "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCt0tiRjcZE0XU38M6aBxaRd+fsVUnezP8/fStL4vgxuzu84FFTTK6XIVsMWVbDxROVA0H98Sexfm8ZDgxYxQ5g6Nk1wbWxB9wbPHrykJScV3tUGWh4RgDq4rm/8+k7Msd32Ubj2t5962W1h1s+zhax4io0bSvJtKj9Ke1Bcq86zEcFg3iPIyxSY8oKK02luT3m73/0L+2BzVXzSsUwE23Nh37+jVAWt+aXHJPlg7YNJGN/9luqx90I3BiPDRtsj4AZ8OtYvBgpDbGK8cVMEDGvOg0+a8JE8oebzgzArYNlWskCRvdvVPU3rk8GKjSKDC2j5lS5sjiiT342Qc3VjoScg5pNfT5fBQ2KI3+LQW9UGwtQwX4er+rjVMM72f1BRI5pIBuwWQIQJzwlqF/KUZcCC5Ly8H8S96thxpmsnXz89GOfldFbQQa/IP5dqCERFpncg1stnDjtGUGU7LIBcBkxqC0ixHgsNKlqFkpT/2oP3y8J875sxykKcigq2Gzw8avXH0W/7nA2pGIrqMMNvMs4ORmSi0FufwNnMsmJYW/baOPUgO+yzUeWLL9wajFo7ozoXz/Q1mLhd8791XO8+ph1Txs0smqV0BK7l/8rAOfhJB1t/aLIB1D+AuVxkfFXUDCcTT4LHSpS7LgI+vsegcN5rPPdFucgMoxLncEC5Fwfww== Jeff Coffler Common Key"
Anand wasn't comfortable with the raw password being in this file. His concern is that if the file is forwarded around in E-Mail, we're forwarding around credentials, which would be unfortunate.
I propose a new field in the credentials section: password_type
. It would have three possible values:
password
field would contain the name of the file (relative to /etc since that's where the YAML file is).So, for an example with plaintext
:
credentials:
username: "hanauser"
password: "somerandompassword"
password_type: plaintext
An example with base64
:
credentials:
username: "hanauser"
password: "c29tZXJhbmRvbXBhc3N3b3Jk"
password_type: base64
Finally, an example with external
:
credentials:
username: "hanauser"
password: "password_file"
password_type: external
In this case, since password_file
isn't an absolute path, we'd look for /etc/password_file
, and we would expect to find something like:
password: "c29tZXJhbmRvbXBhc3N3b3Jk"
It would be assumed that any password_type
of type external
would always be stored in Base 64 format.
Is it possible to do this?
The user service does not allow me to update the root account since it already exists.
It should allow for this (password changes, adding of SSH key).
This is as per discussion in private E-Mail.
Related to #29
When we are "done" with the YAML and all sections are implemented we should add a schema and add a validation service for the received configuration.
In certain cases, our VLI systems may not reboot automatically without operator intervention. This is a problem if a customer tries to reboot a system and would then need to contact operations to get the system back up.
In chatting with HPE about this, one of their engineers said:
Autorebooting - startup.nsh is just one option. HPE has no opinion on a preferred method of autobooting. So your choice
startup.nsh has a decided advantage of surviving both bios updates and ‘power -c reset’. The latter being a useful tool for clearing up some ‘logical’ problems and slso required after a BIOS update. So, in my opinion, startup.nsh is preferred over setting a BIOS boot option through the Bios menu.
So while HPE themselves has no opinion on setting up autoboot, their engineers like it, and so do I. We do this as follows:
echo "fs0:\efi\sles_sap\grubx64.efi" > /boot/efi/startup.nsh
Weird that the file as \
characters in it, but I imagine you folks understand this better than I do. We'd like this to be set up automatically for VLI platforms.
NOTE: This is specific to VLI - Very Large Instance - platforms!
In case the crash kernel memory range was adapted such that the current memory range is different from the new configuration, a reboot of the system (reload of the kernel) is required.
The calculation of the crash memory range is based on the value of kdump calibrate
If the values are different compared to the current crash setup we kexec
reboot the system as last action of the cleanup service
We do a number of saptune operations for optimization for HANA:
saptune daemon start
saptune solution apply HANA
tuned-adm profile sap-hana
systemctl enable tuned
systemctl start tuned
This procedure should be included into the system setup service and must be called after the package installer service
We generally configure the SUSE firewall as follows:
# cat /etc/sysconfig/SuSEfirewall2 | grep 112
FW_SERVICES_EXT_TCP="22 53 3128 1128"
FW_SERVICES_EXT_UDP="53 1129"
I can get you more data (like the entire file) if needed, just let me know.
What's recommended practice these days for SUSE firewall configuration?
I'm looking at our VLI systems and how we'll be booting for legacy (Gen3) systems.
I am NOT able to boot an ISO file as a USB image. For virtual devices, I have three choices:
If I try to select an .iso
file for USB Key, I get an error: "Invalid image file. Cannot redirect Hard disk/USB Key Image."
That error does NOT occur if I mount the .ISO file as an ISO image.
I'm unfamiliar with these imaging styles. How is a .IMG file different than a .ISO file? Can I convert from one to the other? Or can you provide a .IMG file that I can try to work with?
Thanks in advance.
ON the latest test image (v1.0.34
), I encountered:
azurehost:~ # systemctl status -l azure-li-machine-constraints
* azure-li-machine-constraints.service - Validation of Azure Li/VLi machine constraints
Loaded: loaded (/usr/lib/systemd/system/azure-li-machine-constraints.service; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2018-07-19 21:31:08 UTC; 57min ago
Main PID: 37535 (code=exited, status=1/FAILURE)
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: load_entry_point('azure-li-services==1.1.4', 'console_scripts', 'azure-li-machine-constraints')()
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/machine_constraints.py", line 45, in main
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: machine_constraints
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/machine_constraints.py", line 75, in check_main_memory_validates_constraint
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: min_bytes, binary=True
Jul 19 21:31:08 linux azure-li-machine-constraints[37535]: azure_li_services.exceptions.AzureHostedMachineConstraintException: Main memory: 1.48 TiB is below required minimum: 1.5 TiB
Jul 19 21:31:08 linux systemd[1]: azure-li-machine-constraints.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 21:31:08 linux systemd[1]: Failed to start Validation of Azure Li/VLi machine constraints.
Jul 19 21:31:08 linux systemd[1]: azure-li-machine-constraints.service: Unit entered failed state.
Jul 19 21:31:08 linux systemd[1]: azure-li-machine-constraints.service: Failed with result 'exit-code'.
Warning: azure-li-machine-constraints.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #
The important part: Main memory: 1.48 TiB is below required minimum: 1.5 TiB
.
Very very close. The BIOS states that I have 1.5TB, but obviously, the O/S thinks ever so slightly differently.
We can compensate for this by reducing what we believe the memory size to be to memory size - 0.05 TB. Or maybe machine constraints should have a "noise cancellation" for differences?
We're still having problems mounting LUNs reliably from our Linux blade (will follow up with that offline after additional testing).
However, I've since learned that Azure data centers get these often use DVD drives to get data. As we're moving more towards automation of infrastructure, it's becoming easier for us to make a DVD visible to a blade upon boot.
Would it be possible to look for a LUN (as existing code does) for the YAML and, if the LUN doesn't exist, then look at the local DVD drive on the system? That gives us an option and might be easier for us in the future.
Please let me know, thanks.
It seems like service azure-li-user
is a bear to work out, it's been problematic in the past three builds.
In the latest build, the service still fails with the following error:
azurehost:~ # systemctl status -l azure-li-user
* azure-li-user.service - Setup of Azure Li/VLi workload user
Loaded: loaded (/usr/lib/systemd/system/azure-li-user.service; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2018-08-22 18:02:10 UTC; 1min 56s ago
Main PID: 37598 (code=exited, status=1/FAILURE)
Aug 22 18:02:10 linux azure-li-user[37598]: Traceback (most recent call last):
Aug 22 18:02:10 linux azure-li-user[37598]: File "/usr/bin/azure-li-user", line 11, in <module>
Aug 22 18:02:10 linux azure-li-user[37598]: load_entry_point('azure-li-services==1.1.7', 'console_scripts', 'azure-li-user')()
Aug 22 18:02:10 linux azure-li-user[37598]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 74, in main
Aug 22 18:02:10 linux azure-li-user[37598]: raise AzureHostedException(user_setup_errors)
Aug 22 18:02:10 linux azure-li-user[37598]: azure_li_services.exceptions.AzureHostedException: [AzureHostedCommandException('umount: stderr: umount: /mnt: target is busy\n (In some cases useful info about processes that\n use the device is found by lsof(8) or fuser(1).)\n, stdout: (no output on stdout)',)]
Aug 22 18:02:10 linux systemd[1]: azure-li-user.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 18:02:10 linux systemd[1]: Failed to start Setup of Azure Li/VLi workload user.
Aug 22 18:02:10 linux systemd[1]: azure-li-user.service: Unit entered failed state.
Aug 22 18:02:10 linux systemd[1]: azure-li-user.service: Failed with result 'exit-code'.
Warning: azure-li-user.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #
Seems like it's trying to unmount the YAML LUN. That's going to be problematic; I guess to unmount you'd need to make sure all other services are done, and that there's no timing/race conditions. Perhaps it's safest to handle this in the cleanup service, which runs after everything else is completed?
I'm still trying to get a viable build for PM testing. Once this is fixed, please generate a new .xz
file for me to retest. Thanks.
In the SUSE Linux Enterprise Server 12.x for SAP Applications Configuration Guide for SAP HANA, in Section 8.1, there is a package list of packages that should be installed. We are missing some of these.
While this list is not exhaustive, we are missing at least:
I can go through for a more exhaustive list, if needed. But based on our docs, we need those packages for support or proper operations.
Can you please review section 8.1 and let me know your thoughts, thanks.
I tested azure-li-storage, and it's failing. I believe there's a dependency problem, and the network needs to be up in order to mount NFS shares. Here's the relevant output:
linux:~ # systemctl -l status azure-li-storage
* azure-li-storage.service - Setup of Azure Li/VLi Storage Mountpoints
Loaded: loaded (/usr/lib/systemd/system/azure-li-storage.service; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2018-06-27 18:04:46 UTC; 4min 8s ago
Main PID: 26775 (code=exited, status=1/FAILURE)
Jun 27 18:04:46 linux azure-li-storage[26775]: azure_li_services.exceptions.AzureHostedCommandException: mount: stderr: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: mount.nfs: Network is unreachable
Jun 27 18:04:46 linux azure-li-storage[26775]: , stdout: (no output on stdout)
Jun 27 18:04:46 linux systemd[1]: azure-li-storage.service: Main process exited, code=exited, status=1/FAILURE
Jun 27 18:04:46 linux systemd[1]: Failed to start Setup of Azure Li/VLi Storage Mountpoints.
Jun 27 18:04:46 linux systemd[1]: azure-li-storage.service: Unit entered failed state.
Jun 27 18:04:46 linux systemd[1]: azure-li-storage.service: Failed with result 'exit-code'.
Warning: azure-li-storage.service changed on disk. Run 'systemctl daemon-reload' to reload units.
linux:~ #
Can you fix this issue please?
Once processed the ssh-private-key should be deleted from the yaml file. This imho can be done as part of the cleanup service
Multiple bugs here, but all related to user creation:
Given a YAML file containing this segment:
credentials:
-
username: "hanauser"
shadow_hash: $6$/i4bVTW/Ec/0o.eU$6BFOeP9xn4ZqoDzYbBjuZZupe9VpiqAn9wrXHmZ.Rcyyl.AxQSzmja75OOupHAuzUQ1WaI8MIUqLKdavT252t/
ssh-key: "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCt0tiRjcZE0XU38M6aBxaRd+fsVUnezP8/fStL4vgxuzu84FFTTK6XIVsMWVbDxROVA0H98Sexfm8ZDgxYxQ5g6Nk1wbWxB9wbPHrykJScV3tUGWh4RgDq4rm/8+k7Msd32Ubj2t5962W1h1s+zhax4io0bSvJtKj9Ke1Bcq86zEcFg3iPIyxSY8oKK02luT3m73/0L+2BzVXzSsUwE23Nh37+jVAWt+aXHJPlg7YNJGN/9luqx90I3BiPDRtsj4AZ8OtYvBgpDbGK8cVMEDGvOg0+a8JE8oebzgzArYNlWskCRvdvVPU3rk8GKjSKDC2j5lS5sjiiT342Qc3VjoScg5pNfT5fBQ2KI3+LQW9UGwtQwX4er+rjVMM72f1BRI5pIBuwWQIQJzwlqF/KUZcCC5Ly8H8S96thxpmsnXz89GOfldFbQQa/IP5dqCERFpncg1stnDjtGUGU7LIBcBkxqC0ixHgsNKlqFkpT/2oP3y8J875sxykKcigq2Gzw8avXH0W/7nA2pGIrqMMNvMs4ORmSi0FufwNnMsmJYW/baOPUgO+yzUeWLL9wajFo7ozoXz/Q1mLhd8791XO8+ph1Txs0smqV0BK7l/8rAOfhJB1t/aLIB1D+AuVxkfFXUDCcTT4LHSpS7LgI+vsegcN5rPPdFucgMoxLncEC5Fwfww== Jeff Coffler Common Key"
-
username: "root"
shadow_hash: $6$/i4bVTW/Ec/0o.eU$6BFOeP9xn4ZqoDzYbBjuZZupe9VpiqAn9wrXHmZ.Rcyyl.AxQSzmja75OOupHAuzUQ1WaI8MIUqLKdavT252t/
ssh-key: "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCt0tiRjcZE0XU38M6aBxaRd+fsVUnezP8/fStL4vgxuzu84FFTTK6XIVsMWVbDxROVA0H98Sexfm8ZDgxYxQ5g6Nk1wbWxB9wbPHrykJScV3tUGWh4RgDq4rm/8+k7Msd32Ubj2t5962W1h1s+zhax4io0bSvJtKj9Ke1Bcq86zEcFg3iPIyxSY8oKK02luT3m73/0L+2BzVXzSsUwE23Nh37+jVAWt+aXHJPlg7YNJGN/9luqx90I3BiPDRtsj4AZ8OtYvBgpDbGK8cVMEDGvOg0+a8JE8oebzgzArYNlWskCRvdvVPU3rk8GKjSKDC2j5lS5sjiiT342Qc3VjoScg5pNfT5fBQ2KI3+LQW9UGwtQwX4er+rjVMM72f1BRI5pIBuwWQIQJzwlqF/KUZcCC5Ly8H8S96thxpmsnXz89GOfldFbQQa/IP5dqCERFpncg1stnDjtGUGU7LIBcBkxqC0ixHgsNKlqFkpT/2oP3y8J875sxykKcigq2Gzw8avXH0W/7nA2pGIrqMMNvMs4ORmSi0FufwNnMsmJYW/baOPUgO+yzUeWLL9wajFo7ozoXz/Q1mLhd8791XO8+ph1Txs0smqV0BK7l/8rAOfhJB1t/aLIB1D+AuVxkfFXUDCcTT4LHSpS7LgI+vsegcN5rPPdFucgMoxLncEC5Fwfww== Jeff Coffler Common Key"
root
using the SSH key. Examination shows that directory /root/.ssh
does not exist.hanauser
using SSH key. Examination shows that directory /home/hanauser/.ssh
is owned by root./home/hanauser/.ssh
, I still couldn't log in by key. /home/hanauser/.ssh/authorized_keys
is also owned by root.Energy Performance Settings for sap hana instances is recommended to use:
# CPU Frequency/Voltage scaling
cpupower frequency-set -g performance
# low latency/maximum performance
cpupower set -b 0
Add this for persistency into /etc/init.d/boot.local. The image has the rc-local service activated which causes the reading of that file.
Currently we allow the image root login on the console via an insecure password.
This issue exists to not forget to remove the password for root once the image is handed over to customers. Also delete the pre authorized_keys setup from the overlay files archive
Would it be possible that the config file name and location on the filesystem is at a fixed position and name. This would avoid a filesystem search and leads to faster code
One suggestion from Jeff was
<lun_mount_point>/suse_firstboot_config.yaml
That would work for me but I'm also open for any name which works best for you
Note the following output:
soltyo32:~†# rpm -qa | grep -i kmod
kmod-compat-17-9.6.1.x86_64
libkmod2-17-9.6.1.x86_64
* kmod-usnic_verbs-1.1.545.8.sles12sp3-1.x86_64
kmod-17-9.6.1.x86_64
The line marked with "*" is missing on the test image, and we normally have this (and check for it). What is this, and should we have it?
I've been looking into verification tests that we run since this project is rapidly wrapping up for LI. After investigation, there are some issues.
In brief conversations with @rjschwei earlier, he's hesitant to optimize the system in any way since "perfect" optimizations are often load dependent and thus should be performed by the customer. While that sounds good in theory, we seem to be missing some optimizations that I consider pretty basic.
For example:
We currently have some settings via saptune
(this seems to be a SUSE-specific thing; I never heard of that before). saptune
is installed but not running or configured for HANA on our system.
We have some settings in /etc/init.d/boot.local
, like:
# cat /etc/init.d/boot.local
cpupower frequency-set -g performance
cpupower set -b 0
echo 0 > /sys/kernel/mm/ksm/ru
Seems reasonable enough to if CPU set for performance, for example, rather than power efficiency.
/proc/cmdline
, which I imagine had to come from SAP OSS notes:BOOT_IMAGE=/vmlinuz-4.4.120-94.17-default root=UUID=8c094969-faeb-4bd0-af46-ee2f78bdf0c9 resume=/dev/mapper/3600a0980383044456a2b4b596f346d70-part3 splash=silent quiet showopts
numa_balancing=disable transparent_hugepage=never intel_idle.max_cstate=1 processor.max_cstate=1
I'm still picking up what I need to know about image verification, so there may be other items.
Despite what @rjschwei had to say, this sort of stuff seems pretty basic. I would very much like to see images come this way (rather than writing a script, executed through YAML, that make the appropriate changes).
What say ye?
We need a runtime check, if storage is configured then we also need to handle storage authentication. This currently implies we need to have the 'ssh-private-key' entry in the YAML. This is related to discussion #42
If I'm creating a private key on an alternate user (say hanatest
), with the addition of a line like:
ssh-private-key: "ssh/id_dsa_netapp"
to a user in the credentials section, then I get an error upon execution of that service at boot-up:
azurehost:~ # systemctl status -l azure-li-user
* azure-li-user.service - Setup of Azure Li/VLi workload user
Loaded: loaded (/usr/lib/systemd/system/azure-li-user.service; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2018-08-15 15:51:38 UTC; 3min 33s ago
Main PID: 37503 (code=exited, status=1/FAILURE)
Aug 15 15:51:38 linux azure-li-user[37503]: load_entry_point('azure-li-services==1.1.5', 'console_scripts', 'azure-li-user')()
Aug 15 15:51:38 linux azure-li-user[37503]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 52, in main
Aug 15 15:51:38 linux azure-li-user[37503]: setup_ssh_authorization(user)
Aug 15 15:51:38 linux azure-li-user[37503]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 143, in setup_ssh_authorization
Aug 15 15:51:38 linux azure-li-user[37503]: Command.run(['umount', ssh_key_source.location])
Aug 15 15:51:38 linux azure-li-user[37503]: UnboundLocalError: local variable 'ssh_key_source' referenced before assignment
Aug 15 15:51:38 linux systemd[1]: azure-li-user.service: Main process exited, code=exited, status=1/FAILURE
Aug 15 15:51:38 linux systemd[1]: Failed to start Setup of Azure Li/VLi workload user.
Aug 15 15:51:38 linux systemd[1]: azure-li-user.service: Unit entered failed state.
Aug 15 15:51:38 linux systemd[1]: azure-li-user.service: Failed with result 'exit-code'.
Warning: azure-li-user.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #
If you can take a look at this, that would be great, thanks.
Add support for handling the machine_constraints section
machine_constraints:
min_cores: 32
min_memory: "20tb"
There is the packages section in the yaml config as follows:
packages:
directory: "/mnt/directory-with-rpm-files"
directory: "/mnt/another-directory-with-rpm-files"
I'd like to start a discussion on how those packages should become installed
Basically there are two options:
Take all packages from the listed directories and install them the brute force way by calling rpm
Create a clean repository from the packages using the createrepo
tool, register that repo to the zypper packagemanager and install all packages in a transaction with the zypper
tool
I vote to do it "right" with option 2. That's because of the following reasons:
A package install transaction through zypper checks for consistency and dependencies and allows to resolve all package dependencies in a clean way. Of course there is also potential that this fails resulting in the package to be not installable
A package installation via zypper takes all other configured repositories into account and helps to keep the system integrity intact
A package installation via zypper provides good error messages in terms of problems with dependencies, conflicts and others
The brute force way ignores all that and brain dead installs data on the machine which I personally would not do. If there are no objections I'll implement option 2.
Thoughts ?
In discussion of boot performance with Cisco, they commented that their blades boot much more slowly in MS-DOS compatibility mode, and suggested that use of UEFI would be much faster. They also stipulated that they didn't believe Linux supported UEFI.
Can the SUSE O/S image support booting via UEFI? What's involved for that?
In the past, we've always had both Cisco enic
and Cisco fnic
drivers installed on LI systems. For some reason, this image doesn't appear to have that:
soltyo32:~ # rpm -qa | grep -i cisco
cisco-enic-usnic-kmp-default-3.0.44.553.545.8_k4.4.73_5-1.x86_64
* cisco-fnic-kmp-default-1.6.0.36_k4.4.73_5-1.x86_64
(Line marked with "*" does not exist on the test image)
Only enic is showing up.
Note: When we do "modinfo fnic", the driver does show up (but version 1.6.0.34 instead of 1.6.0.36).
In the past, we've had RPMs for both enic and fnic.
This may be a Cisco issue (I'm not sure). If so, let me know and I'll involve them.
This isn't necessarily a problem, but a discussion topic.
We've noted that our SP3 image does not create a separate /boot partition. Now, in yesteryear, it was common to have a separate /boot partition, partly for "safety" partly to insure there was always a bit of free space there "just in case", etc. In the past, we've worked with systems that create this.
What's common practice for a separate /boot
partition these days? Does SAP/HANA give specific guidance in this area?
Back from (yet another) vacation, testing image 1.0.35. Sorry for delay.
Upon booting the blade with image 1.0.35, no errors where shown on the console. Yet, when I took a look at the services, I observed this:
azurehost:~ # systemctl -a | grep azure-li
azure-li-call.service loaded inactive dead Setup of Azure Li/VLi Script Caller
* azure-li-cleanup.service loaded failed failed Cleanup/Uninstall Azure Li/VLi services
azure-li-config-lookup.service loaded inactive dead Lookup and import Azure Li/VLi config file
azure-li-install.service loaded inactive dead Installation of custom Azure Li/VLi addon packages
azure-li-machine-constraints.service loaded inactive dead Validation of Azure Li/VLi machine constraints
azure-li-network.service loaded inactive dead Setup of Azure Li/VLi network configuration
azure-li-report.service loaded inactive dead Report status of Azure Li/VLi services
azure-li-storage.service loaded inactive dead Setup of Azure Li/VLi Storage Mountpoints
azure-li-system-setup.service loaded inactive dead System Setup of Azure Li/VLi machine
azure-li-user.service loaded inactive dead Setup of Azure Li/VLi workload user
azurehost:~ # systemctl -l status azure-li-cleanup
* azure-li-cleanup.service - Cleanup/Uninstall Azure Li/VLi services
Loaded: loaded (/usr/lib/systemd/system/azure-li-cleanup.service; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2018-08-14 17:20:37 UTC; 6min ago
Process: 53465 ExecStart=/usr/bin/azure-li-cleanup (code=exited, status=1/FAILURE)
Main PID: 53465 (code=exited, status=1/FAILURE)
Aug 14 17:20:32 azurehost systemd[1]: Starting Cleanup/Uninstall Azure Li/VLi services...
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: Traceback (most recent call last):
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: File "/usr/bin/azure-li-cleanup", line 11, in <module>
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/cleanup.py", line 38, in main
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: File "/usr/lib/python3.4/site-packages/azure_li_services/defaults.py", line 46, in get_service_reports
Aug 14 17:20:37 azurehost azure-li-cleanup[53465]: ImportError: No module named 'azure_li_services.status_report'
Aug 14 17:20:37 azurehost systemd[1]: azure-li-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Aug 14 17:20:37 azurehost systemd[1]: Failed to start Cleanup/Uninstall Azure Li/VLi services.
Aug 14 17:20:37 azurehost systemd[1]: azure-li-cleanup.service: Unit entered failed state.
Aug 14 17:20:37 azurehost systemd[1]: azure-li-cleanup.service: Failed with result 'exit-code'.
Warning: azure-li-cleanup.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #
It looks like the cleanup service had some sort of a problem referencing a module. Thoughts?
Due to problems with azure-li-machine-constraints
, my /etc/issue
file has:
linux:~ # cat /etc/issue
!!! DEPLOYMENT ERROR !!!
For details see: "systemctl status azure-li-machine-constraints azure-li-storage"
linux:~ #
Looking at azure-li-storage
, I see no problems (indeed, my YAML doesn't configure that yet - I'm testing that next):
linux:~ # systemctl status azure-li-storage
* azure-li-storage.service - Setup of Azure Li/VLi Storage Mountpoints
Loaded: loaded (/usr/lib/systemd/system/azure-li-storage.service; bad; vendor preset: disabled)
Active: inactive (dead) since Tue 2018-06-19 15:27:25 UTC; 29min ago
Main PID: 26652 (code=exited, status=0/SUCCESS)
Jun 19 15:27:25 linux systemd[1]: Starting Setup of Azure Li/VLi Storage Mountpoints...
Jun 19 15:27:25 linux systemd[1]: Started Setup of Azure Li/VLi Storage Mountpoints.
Warning: azure-li-storage.service changed on disk. Run 'systemctl daemon-reload' to reload units.
linux:~ #
Why is this mentioned in /etc/issue
when it's not actually a problem?
Description
In the scope of the azure li services there is potential that something fails or does not match the machine requirements. In this case a procedure to inform the user and the ability to access the system is desired.
In the scope of sap hana ksmd should not run
# stop ksmd from running but keep merged pages,
echo 0 > /sys/kernel/mm/ksm/run
add this for persistency to /etc/init.d/boot.local. The image has the rc-local service activated which causes the reading of that file.
It turns out that DHCP servers are not available in our bare-metal data centers, and the latest YAML specification assumed that DHCP was available.
I suggest updating the YAML specification as follows:
REMOVE
nics: 2
ADD
networking:
-
interface: eth0
description: Client vlan-10
ip: 10.260.10.51
gateway: 10.250.10.1
subnet mask: 255.255.255.0
-
interface: eth1
description: NFS vlan-51
ip: 10.260.51.51
subnet mask: 255.255.255.0
-
interface: eth2
description: B2B vlan-52
ip: 10.260.52.51
subnet mask: 255.255.255.0
In this particular case, you can infer that there are three NICs. Description is for Microsoft purposes only, and can be ignored by azure-li-services
. A few things to note:
dhcp: true
or dhcp: false
. If DHCP is false, then it should follow with IP/subnet mask[/gateway]. This is purely optional.The latest complete YAML specification, with this proposed change, can be found here.
We've always had a NetApp configuration file, as follows:
sles12sp3:~ # cat /etc/sysctl.d/91-hana-netapp.conf
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216
net.ipv4.tcp_rmem = 65536 16777216 16777216
net.ipv4.tcp_wmem = 65536 16777216 16777216
Our test system doesn't have that. Who's responsible for installation/configuration of that file?
Early on, we made the decision to allow SUSE to choose the GID when creating groups.
It turns out that this won't work. We need to know the UID AND GID (not name) in order to set up the storage subsystem (it works from UID and GID, not names). So having the SUSE system choose the GID won't work, since we won't be able to set up storage. Storage must be set up in advance to be able to deploy the O/S.
I had asked for GID originally and, as our operations team was looking at what we provided, they pointed out this restriction. I was previously unaware of this (sorry).
This is a bit of a show-stopper for us, as we can't do a complete UCS (HLI) deployment, which is the next step in testing. I can download the existing image to do basic testing (make sure prior issues that were resolved were indeed fixed), but in terms of the next main test pass, we'll need this.
We had a recent customer issue where the O/S crashed due to a panic. Unfortunately, the machine was not configured for kernel dumps, so we were unable to capture what happened, much less why.
Are SAP Hana images configured for kernel dumps (dump level 31) by default? If not, can we do that?
I had problems storing a private key due to an unbound variable (resolved in issue #70), but with the latest build (1.0.36-Build2.2)
, a different problem surfaces:
azurehost:~ # systemctl status -l azure-li-user
* azure-li-user.service - Setup of Azure Li/VLi workload user
Loaded: loaded (/usr/lib/systemd/system/azure-li-user.service; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2018-08-20 17:23:21 UTC; 22min ago
Main PID: 37684 (code=exited, status=1/FAILURE)
Aug 20 17:23:21 linux azure-li-user[37684]: Traceback (most recent call last):
Aug 20 17:23:21 linux azure-li-user[37684]: File "/usr/bin/azure-li-user", line 11, in <module>
Aug 20 17:23:21 linux azure-li-user[37684]: load_entry_point('azure-li-services==1.1.6', 'console_scripts', 'azure-li-user')()
Aug 20 17:23:21 linux azure-li-user[37684]: File "/usr/lib/python3.4/site-packages/azure_li_services/units/user.py", line 74, in main
Aug 20 17:23:21 linux azure-li-user[37684]: raise AzureHostedException(user_setup_errors)
Aug 20 17:23:21 linux azure-li-user[37684]: azure_li_services.exceptions.AzureHostedException: [AzureHostedConfigFileSourceMountException('Source mount failed with: primary:mount: /dev/mapper/3600a09803830362f6e2b48516b525761-part1 is already mounted or /mnt busy\n /dev/mapper/3600a09803830362f6e2b48516b525761-part1 is already mounted on /mnt\n, fallbackmount: special device /dev/dvd does not exist\n',)]
Aug 20 17:23:21 linux systemd[1]: azure-li-user.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 17:23:21 linux systemd[1]: Failed to start Setup of Azure Li/VLi workload user.
Aug 20 17:23:21 linux systemd[1]: azure-li-user.service: Unit entered failed state.
Aug 20 17:23:21 linux systemd[1]: azure-li-user.service: Failed with result 'exit-code'.
Warning: azure-li-user.service changed on disk. Run 'systemctl daemon-reload' to reload units.
azurehost:~ #
Looks like the code is having an issue reaching out to get the private key (although similar code in script execution, for example, seems to work).
Please advise and let me know, thanks.
I've finally been able to meet with operations team and discuss compute setup. This will generate a number of new issues. Sorry it took so long, it's hard to get meetings with operations team, and we discussed a long list of topics (storage setup, network setup, compute setup, etc).
First on the list: We need a way to set the hostname of the system.
It does not appear necessary, based on my information, to set a domain name, or to otherwise configure /etc/hosts
file.
Final issue that came up in meetings with Operations:
We need to configure SSH for access to storage. This is done by SSH public/private key configuration:
/root/.ssh/id_dsa
This is currently done with a command like ssh-kengen -t dsa -b xxx
.
It would be a pain to get the public key off of the system (after generating the password hashes, we don't save the source password(s), if any). So I propose we generate the key elsewhere and then add the private key to the YAML file to be stored in /root/.ssh/id_dsa
. Perhaps, in the credentials section, add:
username: root
...
private_key:
name: /root/.ssh/id_dsa
ssh-key: "XXXXX"
Thoughts?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.