Giter Club home page Giter Club logo

Comments (31)

patkinson01 avatar patkinson01 commented on May 24, 2024 1

Thanks @gthao313!

Few other pieces of information which may be relevant here:

The VPC where the cluster is running has a non-standard DHCP option set - we use our own nameservers and not AmazonProvidedDNS. domain-name is set to: eu-west-1.compute.internal

We reduce IMDSv2 hop limit to 1 (I've just tried increasing to 2 but didn't fix it)

We also set https-proxy and no-proxy in the userdata
no-proxy = [ "192.168.0.0/16", "10.0.0.0/8", "100.64.0.0/10", "localhost", "127.0.0.1", "169.254.169.254", ".compute.internal", ".cluster.local.", ".cluster.local", ".svc", ".eks.amazonaws.com", ".s3.eu-west-1.amazonaws.com", ".s3.dualstack.eu-west-1.amazonaws.com" ]

from bottlerocket.

yeazelm avatar yeazelm commented on May 24, 2024 1

Hi @yeazelm , thanks for the additional info and pointer! Can I just confirm that your suggestion to add .ec2.amazonaws.com to the no-proxy value is based on an assumption we have implemented an EC2 interface endpoint service within the VPC? Thanks :)

Correct. You'll need an interface VPC endpoint for EC2.

from bottlerocket.

bcressey avatar bcressey commented on May 24, 2024 1

It'll let you avoid having to specify an arbitrary hostname in hostname-override if you're using the aws cloud provider.

@etungsten - after #3582, there's still a need to specify an arbitrary hostname, right? It just won't actually be rendered in the final config and won't affect kubelet behavior.

from bottlerocket.

atkins4aviva avatar atkins4aviva commented on May 24, 2024 1

I am pleased to report that the new Bottlerocket AMI is working as expected without the bootstrap workaround.

Thankyou to all at AWS and Bottlerocket that have helped us resolve this issue. It's taken a while but we finally got there!

Thanks again,

  • Steve

from bottlerocket.

gthao313 avatar gthao313 commented on May 24, 2024

@sonalita Thanks for opening the ticket. We will investigate on it and try to reproduce it.

from bottlerocket.

gthao313 avatar gthao313 commented on May 24, 2024

@patkinson01 @sonalita Sorry for late reply. I was trying to test it and unable to reproduce it. My approach was to create 1.25 EKS cluster with a nodegroup which has some 1.25 nodes, then update the eks cluster and nodegroup to 1.26 version. Both of them were going well on my test. To narrow down the issue, I need your help on the test and questions.

Can you launch few new 1.26 nodes to your eks cluster to validate if they are able to join the cluster?

Were you aware of any newtwork setup changed during the upgrade?

Thanks!

from bottlerocket.

patkinson01 avatar patkinson01 commented on May 24, 2024

Hi @gthao313 - the first scenario worked for us too, the 1.25 upgrade to 1.26 step worked. The issue comes when we destroy the nodegroup and try to create a new nodegroup with new 1.26 nodes.

Nothing else should have changed in the config - everything is 'as code' and pushed out via Terraform.

from bottlerocket.

yeazelm avatar yeazelm commented on May 24, 2024

We also set https-proxy and no-proxy in the userdata
no-proxy = [ "192.168.0.0/16", "10.0.0.0/8", "100.64.0.0/10", "localhost", "127.0.0.1", "169.254.169.254", ".compute.internal", ".cluster.local.", ".cluster.local", ".svc", ".eks.amazonaws.com", ".s3.eu-west-1.amazonaws.com", ".s3.dualstack.eu-west-1.amazonaws.com" ]

I wanted to follow up that I noticed your no-proxy might need an additional value: .ec2.amazonaws.com which is what the node uses to determine its name to connect to the cluster. This change added the need to get this private name via our internal code in pluto as you see in the logs. Your node needs to be able to call EC2 to confirm its private name for joining the cluster. Can you try adding that (and ensuring the IAM policies you are adding to the node have this access too) and see if that resolves this issue?

from bottlerocket.

patkinson01 avatar patkinson01 commented on May 24, 2024

Hi @yeazelm , thanks for the additional info and pointer! Can I just confirm that your suggestion to add .ec2.amazonaws.com to the no-proxy value is based on an assumption we have implemented an EC2 interface endpoint service within the VPC? Thanks :)

from bottlerocket.

yeazelm avatar yeazelm commented on May 24, 2024

It sounds like this might have resolved your issue, can you confirm if the endpoint and permissions solved your issue?

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

@yeazelm No, as Paul said above, our environment is quite tightly locked down so we are not able to add the endpoint and no-proxy change - we are able to reach our through our normal proxy. FYI we are also trying to find a solution too via AWS support. who had us configure the admin container. We were successfully able to do so on a 1.25 node and use sheltie to get at logs but unfortunately, with a 1.26 node, the instances are not even getting as far as deploying the Bottlerocket admin container.

from bottlerocket.

yeazelm avatar yeazelm commented on May 24, 2024

For the external AWS cloud provider (which was added in 1.26, and the in-tree provider was removed in 1.27), here is a work around to try which might let the node come up. This could possibly work for 1.26 but will not work for 1.27 and later (due to the removal of the intree provider) and you will need the EC2 endpoint available to your nodes going forward on 1.27 and later. Nonethless, you might try this as a workaround until figure out a path to getting the EC2 endpoint sorted out.

set settings.kubernetes.cloud-provider to aws
set settings.kubernetes.hostname-override to an empty string to skip pluto timing out on the EC2 request

This essentially will revert to the old behavior on 1.25.

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Hi @yeazelm
After adding those two settings, we now see a different error in the ec2 instance system log:

[  OK  ] Finished wicked managed network interfaces.
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
         Starting Bottlerocket userdata configuration system...
[    3.554275] early-boot-config[1325]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 400 when PATCHing /settings?tx=bottlerocket-launch: Json deserialize error: Unable to deserialize into ValidLinuxHostname: Invalid hostname '': must only be [0-9a-z.-], and 1-253 chars long at line 1 column 1745
[FAILED] Failed to start Bottlerocket userdata configuration system.
See 'systemctl status early-boot-config.service' for details.

Our (redacted) userdata looks like this:

settings.kubernetes.cluster-name = 'xxx'
settings.kubernetes.api-server = 'https://xxx.gr7.eu-west-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'xxx'
settings.kubernetes.cluster-dns-ip = '192.168.0.10'
settings.kubernetes.max-pods = 110
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup-image' = 'ami-03b30c03b4fd62ad5'
settings.kubernetes.node-labels.'eks.amazonaws.com/capacityType' = 'ON_DEMAND'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateVersion' = '1'
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'managed-ondemand-20231101134231957200000004'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateId' = 'lt-0ae7efeb7273a132c'
settings.kubernetes.node-labels.'bottlerocket.aws/updater-interface-version' = '2.0.0'
settings.kubernetes.cloud-provider = 'aws'
settings.kubernetes.hostname-override = ''
settings.network.no-proxy = ['192.168.0.0/16', '10.0.0.0/8', '100.64.0.0/10', 'localhost', '127.0.0.1', '169.254.169.254', , '.compute.internal', ', '.cluster.local.', '.cluster.local', '.svc', '.eks.amazonaws.com', '.s3.eu-west-1.amazonaws.com', '.s3.dualstack.eu-west-1.amazonaws.com', '.vpce.amazonaws.com']
settings.network.https-proxy = 'xxx'
settings.container-registry.credentials = [{registry = 'xxx', username = 'xxx', password = 'xxxx'}]
settings.host-containers.admin.enabled = true
settings.host-containers.admin.user-data = 'xxx'
settings.kernel.sysctl.'user.max_user_namespaces' = '0'
settings.kernel.sysctl.'vm.max_map_count' = '262144'
settings.kernel.sysctl.'net.ipv4.conf.all.send_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.default.send_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.all.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.default.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv6.conf.all.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv6.conf.default.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.all.secure_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.default.secure_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.all.log_martians' = '1'
settings.kernel.sysctl.'net.ipv4.conf.default.log_martians' = '1'
settings.bootstrap-containers.bottle.source = 'xxx'
settings.bootstrap-containers.bottle.mode = 'once'
settings.bootstrap-containers.bottle.user-data = 'xxxx'
settings.updates.ignore-waves = true
settings.updates.seed = 0

The bootstrap userdata does not contain any hostname information

from bottlerocket.

yeazelm avatar yeazelm commented on May 24, 2024

Ok, I had not tried setting the hostname to '' before asking you to try it. Sadly it makes sense that early-boot-config is not able to handle an empty string and fall back to the default. I'm sorry that didn't work. I'll do a bit more digging to see if I can find a way to get this workaround to work.

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

Hi @sonalita, you can set settings.kubernetes.hostname-override to any arbitrary non-empty hostname string to workaround the issue. kubelet will ignore the --hostname-override option if the AWS in-tree cloud provider is responsible for setting the node name. See kubernetes/kubernetes#64659.

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Hi @etungsten That workaround was successful. We now have 1.26 nodes!
Thank you for your help! We do need a solution that will work for 1.27 and beyond.
I am goiing to pursue getting the EC2 interface endpoint service configured with our cloudops team but that may take a few days.

from bottlerocket.

jooh-lee avatar jooh-lee commented on May 24, 2024

Hi @etungsten and @yeazelm this seems to be a problem as well in 1.27. I have an EKS cluster on 1.27 and same deal had to recreate the nodegroup. On v.1.14 I had no issues with bringing up a Node on 1.27, with 1.16.0 the nodes do not come up at all. The instances do have have a connection to *.ec2.amazonaws.com. and i've tried setting up the admin container, but its not coming up at all. Is there a workaround for 1.27?

This is not a problem with k8s 1.28 and 1.16.0 of the bottlerocket ami.

The arch we're using is x86

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Hi @yeazelm Unfortunately, we're still having issues with the workaround - if I kubectl describe a node, I do see that the hostname is set to "x" - the value I set in the toml file - and this seems to be causing some issues with some addon pods not starting properly and also the kubectl log command on any pod fails with a TLS error. We have confirmed that .ec2.amazonaws.com is reachable via our squid proxy so unless the bottlerocket boot process is not honouring the proxy settings, I'm being told that we should not need to add the VPC endpoint. I have tried with kubectl versions 1.25, 1.26 and 1.27 - so unfortunately although the nodes are joining the cluster, it is not stable and therefore unusable in a production environment.

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

Hi @sonalita, I'm currently wrapping up #3582 to help with the behavior you're seeing. It'll let you avoid having to specify an arbitrary hostname in hostname-override if you're using the aws cloud provider.. You would still need to specify the arbitrary hostname, but it won't be passed to kubelet to avoid the undesired behavior. The in-tree AWS cloud provider would manage the node name matching during registration. Once it merges, we'll be releasing the change in our next 1.16.1 release which should be happening early next week.

One more thing,

We have confirmed that .ec2.amazonaws.com is reachable via our squid proxy

The correct endpoint for the EC2 API should be of the form ec2.<aws-region>.amazonaws.com. So if you're trying to no proxy the EC2 endpoint, that should be the entry to put.

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

Hi @jooh-lee,

I believe what you're seeing is a different issue. There should be no difference between v1.14.0 and v1.16.0 when running in an K8s 1.27 cluster. If your admin container does not come up, that would point to a network issue. Can you please create a separate issue with details about your cluster environment and any relevant host configuration for us to track?

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

Ah right, that's correct. You would still need to specify an arbitrary value to skip pluto settings generation. I've edited my original comment.

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Hi

We have some good news! The PR merge has resulted in a healthy 1.26 node group with all pods running and kubectl logs command is working!

Now we need to focus on a less tactical solution for 1.27 onwards.

To answer the question about the proxy config:

In our proxy config we allow *.amazonaws.com through the proxy with just a few entries in our NO_PROXY where we want to route directly namely:

• .eks.amazonaws.com (For internal EKS Control Plane API Endpoints)
• .s3.eu-west-1.amazonaws.com & .s3.dualstack.eu-west-1.amazonaws.com (for S3)
• .vpce.amazonaws.com (for PrivateLink endpoints)

Is the code definately respecting system proxy and no-proxy settings?

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Hi team, sorry for the delays in replying. We were asked by AWS support to sheltie onto a node and run a describe-instances command. the aws command isn't on the path and a find / -name aws listed many results. The one in /var/lib/provisioning/v2/2.11.4/bin/aws sems to work but when I run /var/lib/provisioning/v2/2.11.4/bin/aws --region eu-west-1 ec2 describe-instances --debug it just hangs.

I'm attaching the debug output (with tokens redacted) for your perusal.

describe-instances.txt

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Hi all

I realised that when I did my tests on Thursday, I may not have set the proxy variables correctly after issuing the sheltie command. I have repeated my tests this morning and can confirm that with the environment variables https_proxy, http_proxy and no_proxy set correctly, we can indeed successfully execute an aws ec2 describe-instances command.

The debug output is attached with security tokens etc. redacted as well as truncating most of the XML returned. The command I used was
aws --region eu-west-1 ec2 describe-instances --filters Name=image-id,Values=ami-004a21828789c1a10 --debug

For information, I’ve confirmed that the launch template for the instances has these settings. The values of which match what I set for https_proxy and no_proxy env vars in my test.

settings.network.no-proxy = [ ]
settings.network.https-proxy = '<our proxy url/credentials> '

Are these definitely set in the environment at the time you run the describe-instances command?

Hopefully this information will help you to debug the issue further.

from bottlerocket.

yeazelm avatar yeazelm commented on May 24, 2024

Hey @sonalita, thanks for the updates! It sounds like there might be some nuance in how you are configuring your proxy. What was the change you needed to do to get it working? I wonder if we have additional work to do to handle what that change is?

Are these definitely set in the environment at the time you run the describe-instances command?

We pass them directly to pluto and have confirmed pluto respects these settings. pluto is what ends up calling the EC2 DescribeInstances API via the Rust SDK. It might be a matter of formatting these variables to ensure they work correctly.

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

Copying the info I added to the AWS support case here for reference (I was asked for the console logs and userdata again)

Hi As requested, I'm attaching the console log for an instance attempting to join a Kubermetes 1.27 nodegroup and the sanitized userdata from its Launch template.
As you can see on line 178, we are still seeing the Pluto error. You can see we are setting proxy in the userdata, and have previously confirmed that on a running node, we can sheltie into the node via the admin container and successfully run a describe-instances command after setting the https_proxy and no_proxy environment variables on that session to match the userdata configuration.

br-1.16.1-k8s-1-27-userdata.txt
br-1.16.1-k8s-1-27-bootlog.txt

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

Hi @sonalita, thanks for the additional info.

In your previous comment you mentioned your https_proxy configuration contains proxy credentials. I'm assuming it's in the format of
https_proxy="http://username:[email protected]:80"

Currently pluto's proxy handling does not handle proxy credentials.

let mut proxy_uri = https_proxy.parse::<Uri>().context(UriParseSnafu {
input: &https_proxy,
})?;
.

I think that's the issue you're running into right now. Unfortunately I don't have a quick workaround for this at the moment. pluto needs to be taught how to extract and use proxy creds.

I'm gonna go ahead and update the title of this issue so we can track this more accurately.

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

One workaround that comes to mind is to use bootstrap containers to basically replace what pluto is trying to do.

Firstly, you need to skip pluto execution during boot by setting settings.kubernetes.hostname-override to a random non-empty string value in your userdata.
In the bootstrap container, you can call aws-cli to fetch the private DNS name for the instance through describe-instance and set settings.kubernetes.hostname-override to that value via apiclient. kubelet should then work as expected.

from bottlerocket.

sonalita avatar sonalita commented on May 24, 2024

@etungsten Hi
SUCCESS!!!!!!!!!!! For documentation and to help others facing the same issue, here's what I did.

I added the hostname-override variable to the bottlerocket userdata and then I added the following to our bootstrap container run script:


    TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
    
    echo "PRI-DNS: Got token $TOKEN"
    
    INSTANCE_ID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN"  http://169.254.169.254/latest/meta-data/instance-id)
    
    echo "PRI-DNS: Got instance ID: $INSTANCE_ID"
    
    PRIVATE_DNS=$(aws ec2 --region "$REGION" describe-instances --instance-ids $INSTANCE_ID --query 
   'Reservations[*].Instances[*].{Instance:PrivateDnsName}' --output text)
   
    echo "PRI-DNS: Got PRIVATE_DNS: $PRIVATE_DNS"
   
    apiclient set settings.kubernetes.hostname-override=$PRIVATE_DNS

Any idea on when the pluto code will be updated and released?

from bottlerocket.

etungsten avatar etungsten commented on May 24, 2024

Hi @sonalita,

We merged a potential fix for this issue and will be releasing it in the next Bottlerocket release next week.

from bottlerocket.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.