Deion In the scope of the azure li services t

Conversation from <a class="issue-link js-issue-link" data-error-text="Failed

From <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Error handling on service failure about azure-li-services HOT 15 CLOSED

schaefi commented on September 22, 2024

Error handling on service failure

from azure-li-services.

Comments (15)

schaefi commented on September 22, 2024

Conversation from #14 continued...

From @rjschwei
Maybe we are overthinking the error condition handling. Lets get back to the basics here, what we want to prevent is that the user unknowingly of the error condition spends a significant amount of effort in installing SAP, configuring and optimizing it.

Preventing this time investment by the user does not imply that anyone from Msft has to be able to login to the system, it only implies that the user be notified that an error occurred during system configuration.

With this as the framework, if we detect that attached disks, memory, or cpu count are not in compliance with the specification per YAML config we can do the following:

Setup the user per YAML to allow the customer to login
Setup the network per YAML
Change motd to provide guidance on what the user should do next

For example if something goes wrong in the cpu check motd, which is displayed upon ssh login could display the following:

"""
ERROR

System configuration failed, received configuration for setup of 128 CPUs, found 64
Please contact Azure operations team at [email protected]
"""

With the "ERROR" word in sufficiently large ASCII art to make it really difficult to avoid. We can reasonably easy produce customized error conditions for disk space errors and memory errors.

The user at this point knows how to get help, which is really what we need. In the end if such a condition occurs because of a mis-deployment the user cannot help themselves anyway.

We'll have a reasonably simple solution that does not leave the user hanging, it is not 100% fool proof, a user can login via ssh and ignore the message in front of them, but at the very least they had the opportunity to avoid spinning their wheels for nothing. Most people will reasonably accept this as it being their own fault of having invested effort after an error condition.

from azure-li-services.

schaefi commented on September 22, 2024

So you would create a troubleshoot account if an error condition exists if I got it right ?

This means the machine is opened up for access by the Microsoft team on error automatically. Would that not violate the user expectations that everybody is locked out ?
If the error is on the network and/or user service the concept will probably fail

I like the solution, I just don't think we could find a 100% fool proof one

from azure-li-services.

schaefi commented on September 22, 2024

From @rjschwei
I am opposed to creating a "generic account"; IMHO it opens too many holes. For example a disgruntled actor can leave something behind and if the system gets fixed, rather than a "start from scratch" deployment is used there is going to be a big mess. See my previous reply, IMHO we need to provide the customer with a way to help themselves of find help. This can be done without creating a generic account.

from azure-li-services.

schaefi commented on September 22, 2024

@jeffaco

Summarizing the details the current idea would be:

Create a meaningful /etc/motd message in case of a service error or validation violation
user and network services had to run to ensure user and network access in the scope of the yaml config. If errors occur in the user or network services it's considered bad luck

Thoughts ?

from azure-li-services.

jeffaco commented on September 22, 2024

I've given this some thought. Here's where I am on this:

Operations doesn't have access to the system via VLAN for direct SSH access. However, Operations does have access to a KVM.
Operations will not know the password that the user selected. However, our user setup service does have the capability to create multiple accounts.
I think it's really sloppy to deploy a system to a customer in an error state and not know about it, relying on the customer to tell us about it. We need a mechanism to check that things worked (although I'm at a loss of how to do this right now, except by hand).

Given this, I suggest the following:

Create a meaningful /etc/motd message in case of a service error or validation violation
We always create a management account as part of the YAML so we can check for error via management ports (KVM).

This would work, but has two problems, as I see it:

KVM access could likely not be automated, most likely (graphical interface). So, ultimately, this means that final validation would likely need to be done by hand, and couldn't be automated. Not the end of the world, but not optimal.
It's not clear how to delete the account. For test purposes, I created an account with sudo access, foo. When I went to delete it, I got:

jeffcof:~> su - foo
Password: 
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

foo@jeffcof-ub18:~$ sudo ls
[sudo] password for foo: 
foo@jeffcof-ub18:~$ sudo userdel -r foo
userdel: user foo is currently used by process 2846
foo@jeffcof-ub18:~$

This is less than optimal. Any suggestions to be able to delete the account when it's the only account we have on the system?

Could you rig the account to auto-delete itself once it's logged out of? So perhaps it's a "log in once" account? Or perhaps, in case of no errors, just tell Ops that no errors occurred, then log them out automatically and delete the account (without ops having the opportunity to execute commands - unless errors occurred)?

from azure-li-services.

rjschwei commented on September 22, 2024

@jeffaco

User management only works with root privileges. Can you please explain why you think we need to blow a giant hole into the security model?

If the customer has a meaningful error and it needed that Msft OPS gets access to the machine then the customer should be in charge of granting that access, and not some magic we implement. If the magic we implement ever gets exploited that is a good chance that customers would find another place to run their workloads, is it really worth it?

from azure-li-services.

jeffaco commented on September 22, 2024

I don't believe it's a "giant hold into the security model":

Customer data is not on the system at this point, just the O/S. So it's not like PII is at risk here, it's just the O/S.
Today, there is no automatic deployment at all, so it's no different than today, when Ops must get on the system to configure it,
We should be able to be sure that the deployment actually works. I think it's bad form to hand a system over if it's obvious that it won't work. That's just bad customer service (or worse).

That's why I'm looking for some way to know that everything actually worked. I'd love to do it without an account, but we're not coming up with great ideas here.

Is there some way we could get a "deployment is good" or "deployment is not good" indication from the KVM (perhaps changing the login prompt, or issuing a message with the login prompt, if something failed)?

from azure-li-services.

rjschwei commented on September 22, 2024

I don't really want to argue and will just make a couple of more points. In the end the liability rests and will rest with Microsoft.

Customer data is not on the system at this point, just the O/S. So it's not like PII is at risk here, it's just the O/S.

It is somewhat immaterial what is on the system at the time when I access it. I can easily leave something behind that leaks any information later and you'll probably never notice or notice after I already have the information I am after.

Today, there is no automatic deployment at all, so it's no different than today, when Ops must get on the system to configure it,

Isn't the point that we want to make things better? Just because Microsoft is exposed to a big liability today does that imply you want to be exposed to the same liability tomorrow?

In Cloud framework implementations, Azure, EC2, GCE, OpenStack people go through great length to ensure data cannot be accessed by the framework operator.

I am unaware of a way to communicate back to the UCS framework, and even if we can communicate back to UCS I am not certain we can do the same on the VLI systems.

from azure-li-services.

schaefi commented on September 22, 2024

Is there some way we could get a "deployment is good" or "deployment is not good" indication
from the KVM (perhaps changing the login prompt, or issuing a message with the login prompt,

There are two possible data sources which gets displayed when the system reaches the point to allow a user login:

/etc/issue
/etc/motd

the contents of /etc/issue are displayed once the getty process starts (that's the terminal process) You will see this message on any terminal. It normally contains something like

Welcome to openSUSE Leap 42.3 - Kernel \r (\l).

We can change that and use it as an indicator because it is displayed in any case even if you are not logged in.

Once you login the contents of /etc/motd are displayed

"abusing" that files doesn't sound like a good solution to me but if the KVM console the only thing we see it would be an option.

From an implementation point of view it's no problem to adapt the files

from azure-li-services.

schaefi commented on September 22, 2024

All of that information channel only works if no graphics session is displayed. I'm just saying that to make sure any X-server will grab the terminal and no console information will stay there. Our image as we deliver it does not start any graphics thus I know we can display the data and there is the chance that users reads it.

If you consider to base decisions on the data read from the console you should be aware that the concept will break once a graphics session is started at boot time

from azure-li-services.

jeffaco commented on September 22, 2024

But, even if graphics starts at boot time, can you escape out to a text console and then see /etc/issue, even if the system has fully booted? If so, this sounds great to me. Now, of course, we couldn't easily log in to see what happened - the best we could do is redeploy with an account that we know. But I imagine that once things are set up and stable, we won't have intermittent failures.

I'm working on testing the image now. I'll try several boots (booting so so slow though due to post with several TB of memory) to experiment with this.

from azure-li-services.

jeffaco commented on September 22, 2024

I just did a test boot on a blade with the latest (.28) image, here's the boot screen. During boot, I had a graphical boot showing progress, then once finished, without hitting any keys I ended up with boot screen.

I did end up modifying the /etc/issue file and then logged out and logged in, and I noted that I did see the modified /etc/issue file. This seems like it's a reasonable way to show us errors without a lot of gruff. We can just attach to the KVM, not even log in, and see if there's a problem or not. We couldn't diagnose (without working with customer), but I don't expect a lot of problems here.

Does this work for SUSE?

from azure-li-services.

schaefi commented on September 22, 2024

It would be a start. I'm good with that.

Ok so on an error condition we will write an appropriate /etc/issue file. What should be its contents ?
As this will be pointing to the Microsoft operations team I leave it up to you to make a suggestion that fits best for you.

from azure-li-services.

jeffaco commented on September 22, 2024

The line currently says something like:

Welcome to SUSE Linux Enterprise Server for SAP Applications 12 SP3 (x86_64) - Kernel ...

I'm not terribly picky as long as it's something obvious. Maybe:

DEPLOYMENT ERROR - Welcome to SUSE Linux Enterprise Server for SAP Applications 12 SP3 (x86_64) - Kernel ...

Thoughts? As far as Ops is concerned, I'll just tell them what to watch out for. I don't think they'll be terribly picky either.

from azure-li-services.

schaefi commented on September 22, 2024

Ok, I'll come up with a pull request to address this and we decide for the final message at code review.
From a discussion point of view I think we are all clear

from azure-li-services.

Error handling on service failure about azure-li-services HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent