Giter Club home page Giter Club logo

acs's Introduction

Microsoft Azure Container Service

Overview

This repository will serve as a home for tracking known issues regarding the Azure Container Service.

This will also host a document discussing the state of Kubernetes on Azure

Please visit the ACS-Engine repository for issues or questions regarding the use of the open-source core of ACS.

Announcements

Code of conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

acs's People

Contributors

anhowe avatar colemickens avatar jackfrancis avatar jackquincy avatar jchauncey avatar jcorioland avatar jiangtianli avatar olblak avatar rgardler-msft avatar sauryadas avatar seanknox avatar squillace avatar weinong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

acs's Issues

The value 'Standard_NC6' of parameter 'agentProfile.vmSize' is not allowed

I'm trying to use the Standard_NC6 VM type for my Kubernetes cluster using ACS. I've verified that my subscription has these VM types available. This is for the eastus region. However, I'm getting the aforementioned error when trying to bring up a Kubernetes cluster in ACS. Am I doing something wrong?

acs stops working

I've had this happen once before, but that time there were no errors. Just a unresponsive public endpoint and container creation failure. I try my best to find a reason to the failure, but both times I've ended up having to recreate the entire cluster.

This time I'm keeping the failed cluster running for debugging, but because of the cost of running two clusters I can't do that for long.

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

Kubernetes

What happened:

Cluster public IP stops responding, and container creation fails. This is the only error message I can see.

MountVolume.SetUp failed for volume "kubernetes.io/secret/3036090c-6712-11e7-b20f-000d3a29b8df-default-token-2w2qk" (spec.Name: "default-token-2w2qk") pod "3036090c-6712-11e7-b20f-000d3a29b8df" (UID: "3036090c-6712-11e7-b20f-000d3a29b8df") with: mount failed: fork/exec /bin/mount: resource temporarily unavailable Mounting command: mount Mounting arguments: tmpfs /var/lib/kubelet/pods/3036090c-6712-11e7-b20f-000d3a29b8df/volumes/kubernetes.io~secret/default-token-2w2qk tmpfs [] Output:
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "brfportal"/"brfportal-auth-3849485180-4phdk". list of unattached/unmounted volumes=[default-token-2w2qk]

How can I upgrade tiller on ACS?

Is this a BUG REPORT or FEATURE REQUEST? (choose one): FEATURE REQUEST

Every time I try to bump the tiller version up to v2.6.0 in ACS, ACS keeps preventing that by re-deploying tiller v2.5.1, which is what was provided when bootstrapping the cluster. I'd like to upgrade tiller on my cluster, but I cannot do that without switching over to acs-engine to deploy a cluster.

Loadbalancer do not redirect traffic to k8s master

Is this a BUG REPORT or FEATURE REQUEST?:
I am not sure yet if it's a bug or a feature

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
I deployed a cluster with one master and one agent.
I then deploy a service with a public IP (Ingress).
I flagged the agent in 'maintenance mode' and migrate all pods on the master.

The loadbalancer continue to send the traffic to the agent (event if the agent was stopped) and didn't send traffics to the master

What you expected to happen:
I expect the loadbalancer to redirect traffic to the master as well.

How to reproduce it (as minimally and precisely as possible):

  • Deploy a service with a public IP
  • kubectl cordon k8s-agent-********-0
  • kubectl uncordon k8s-master-********-0
    Loadbalancer should send traffic to k8s-master

Anything else we need to know:

I also noticed that by default, new master disable scheduling.
Is it a good practice to avoid pods on the master?

az acs kubernetes get-credentials fails with error "No authentication methods available"

Problem description

Since two months or so, we have scripts up and running which provision Kubernetes cluster on demand for testing purposes, and until today, we have always managed to get the kubeconfig out of the cluster via the following command line:

az acs kubernetes get-credentials \
    --name ${K8S_NAME} \
    --resource-group ${RESOURCE_GROUP} \
    --ssh-key-file k8sadmin.id_rsa \
    --file ./kubeconfig

As of today (noon or so), this fails with this error message:

ERROR: No authentication methods available
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/azure/cli/main.py", line 36, in main
    cmd_result = APPLICATION.execute(args)
  File "/usr/lib/python2.7/site-packages/azure/cli/core/application.py", line 201, in execute
    result = expanded_arg.func(params)
  File "/usr/lib/python2.7/site-packages/azure/cli/core/commands/__init__.py", line 417, in _execute_command
    reraise(*sys.exc_info())
  File "/usr/lib/python2.7/site-packages/azure/cli/core/commands/__init__.py", line 399, in _execute_command
    result = op(client, **kwargs) if client else op(**kwargs)
  File "/usr/lib/python2.7/site-packages/azure/cli/command_modules/acs/custom.py", line 690, in k8s_get_credentials
    _k8s_get_credentials_internal(name, acs_info, path, ssh_key_file)
  File "/usr/lib/python2.7/site-packages/azure/cli/command_modules/acs/custom.py", line 711, in _k8s_get_credentials_internal
    '.kube/config', path_candidate, key_filename=ssh_key_file)
  File "/usr/lib/python2.7/site-packages/azure/cli/command_modules/acs/acs_client.py", line 48, in SecureCopy
    ssh.connect(host, username=user, pkey=pkey)
  File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 381, in connect
    look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
  File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 623, in _auth
    raise SSHException('No authentication methods available')
SSHException: No authentication methods available

Side note: This is from a Linux client running inside docker; running the same directly from macOS renders a slightly different error message:

Authentication failed.
Traceback (most recent call last):
( ... identical to above ... )
AuthenticationException: Authentication failed.

When adding --verbose --debug, this is what gives:

DEBUG: paramiko.transport : Local version/idstring: SSH-2.0-paramiko_2.1.2
DEBUG: paramiko.transport : Remote version/idstring: SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu1
INFO: paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
DEBUG: paramiko.transport : kex algos:[u'[email protected]', u'ecdh-sha2-nistp256', u'ecdh-sha2-nistp384', u'ecdh-sha2-nistp521', u'diffie-hellman-group-exchange-sha256', u'diffie-hellman-group14-sha1'] server key:[u'ssh-rsa', u'rsa-sha2-512', u'rsa-sha2-256', u'ecdsa-sha2-nistp256', u'ssh-ed25519'] client encrypt:[u'[email protected]', u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'[email protected]', u'[email protected]'] server encrypt:[u'[email protected]', u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'[email protected]', u'[email protected]'] client mac:[u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-sha1'] server mac:[u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-sha1'] client compress:[u'none', u'[email protected]'] server compress:[u'none', u'[email protected]'] client lang:[u''] server lang:[u''] kex follows?False
DEBUG: paramiko.transport : Kex agreed: diffie-hellman-group14-sha1
DEBUG: paramiko.transport : Cipher agreed: aes128-ctr
DEBUG: paramiko.transport : MAC agreed: hmac-sha2-256
DEBUG: paramiko.transport : Compression agreed: none
DEBUG: paramiko.transport : kex engine KexGroup14 specified hash_algo <built-in function openssl_sha1>
DEBUG: paramiko.transport : Switch to new keys ...
DEBUG: paramiko.transport : Adding ssh-rsa host key for dev-dev1705041406-f3d364.northeurope.cloudapp.azure.com: b945a4bc4c30a3c303138d80c015d36a
ERROR: No authentication methods available
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/azure/cli/main.py", line 36, in main
    cmd_result = APPLICATION.execute(args)
  File "/usr/lib/python2.7/site-packages/azure/cli/core/application.py", line 201, in execute
    result = expanded_arg.func(params)
  File "/usr/lib/python2.7/site-packages/azure/cli/core/commands/__init__.py", line 417, in _execute_command
    reraise(*sys.exc_info())
  File "/usr/lib/python2.7/site-packages/azure/cli/core/commands/__init__.py", line 399, in _execute_command
    result = op(client, **kwargs) if client else op(**kwargs)
  File "/usr/lib/python2.7/site-packages/azure/cli/command_modules/acs/custom.py", line 690, in k8s_get_credentials
    _k8s_get_credentials_internal(name, acs_info, path, ssh_key_file)
  File "/usr/lib/python2.7/site-packages/azure/cli/command_modules/acs/custom.py", line 711, in _k8s_get_credentials_internal
    '.kube/config', path_candidate, key_filename=ssh_key_file)
  File "/usr/lib/python2.7/site-packages/azure/cli/command_modules/acs/acs_client.py", line 48, in SecureCopy
    ssh.connect(host, username=user, pkey=pkey)
  File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 381, in connect
    look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
  File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 623, in _auth
    raise SSHException('No authentication methods available')
SSHException: No authentication methods available
DEBUG: paramiko.transport : EOF in transport thread

Workaround

We found out we could still log in to the master VM using the k8sadmin.id_rsa, so it does not seem to be that we botched our key pair.

This enables the following workaround, but if there's an az command for it, we figured it'd be better to use that (to be sure we don't run into issues if you decide to put the .kube/config someplace else than on the master VM:

echo "INFO: Trying to scp credentials from master VM."
masterFqdn=$(az acs show --resource-group ${RESOURCE_GROUP} --name ${K8S_NAME} --query 'masterProfile.fqdn' -otsv)
echo "INFO: Master VM FQDN: ${masterFqdn}"
scp -oStrictHostKeyChecking=no -i k8sadmin.id_rsa k8sadmin@${masterFqdn}:/home/k8sadmin/.kube/config ./kubeconfig

Has something changed? Have we missed something, or is this a side effect from someplace else?

Preserving Source IP addresses

I have exposed an nginx container using:

kubectl run my-nginx --image=nginx --port 80
kubectl expose deployment my-nginx --port=80 --type=LoadBalancer --session-affinity=ClientIP

But when I look at the Source IP addresses in the nginx logs I still see the IP address of the cluster.

The docs here state that Source IP preservation on L4 load balancers should be implemented but it's not clear how. As far as I can see my traffic is still being SNAT'd on the kubernetes node.

Is this the desired behaviour?

In order to have the correct source IPs do I have to wait until the Ingress Controller is working via Application Gateways and then examine some header for the original source?

Tx

Cannot connect to newly deployed Kubernetes cluster

Problem: after deploying a fresh cluster, it's not possible to connect by following the instructions in the guide at https://docs.microsoft.com/en-us/azure/container-service/container-service-kubernetes-walkthrough

To reproduce:

  • Deploy a new ACS through the Portal
  • az acs kubernetes get-credentials --resource-group=my-rg --name=my-container-service
  • kubectl get pods

Expected result:

  • List pods

Actual result:

  • Unable to connect to the server: dial tcp 40.xx.xx.xx:443: i/o timeout

I have tried deploying a new cluster several times with the same result.

Creating Kubernetes LoadBalancer service leaks static IP

Is this a request for help?:


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
I have some tests that repeatedly create/delete Kubernetes Load Balancer services. After running the tests a few times, all my services are stuck at pending phase with error message:
"Cannot create more than 20 public IP addresses with static allocation method for this subscription in this region."

What you expected to happen:
Since my tests delete all load balancer services in every run, I expect to not hitting the 20 public IP address limit.

How to reproduce it (as minimally and precisely as possible):
Just repeatedly create/delete LoadBalancer services.

Anything else we need to know:

ACS run out of disk inode, but k8s GC is not working

Sometimes I got error while k8s pulling new images from Docker Registry like:

Failed to pull image "<IMAGE>:latest": failed to register layer: mkdir /var/lib/docker/overlay/545c2788d40c8155df608067746dd90f4e570cc5f406ce968b6e063a1a68917c/tmproot327958565/usr/local/go/test/fixedbugs/bug468.dir: no space left on device

But vm disk does have free space, but no inode.

look like related to moby/moby#10613

Does azure support any command for k8s type acs for quickly solve this problem?

K8s custom VNET scaling is broken

Is this a request for help?:
No

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Bug Announcement

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
K8s

K8s custom VNET clusters currently fail to scale up. This is caused by issue Azure/acs-engine#1195 and will be fixed in acs-engine by PR Azure/acs-engine#1194 . Then we will need to take a service deployment to get it fixed. I'll update here with status as we progress.

exposing with sessionAffinity is not respected

I created a new nginx deployment using:
kubectl run my-nginx --image=nginx --port 80

Then expose it using:
kubectl expose deployment my-nginx --port=80 --type=LoadBalancer --session-affinity=ClientIP

But when I look at the options set in the LB in Azure the 'Session persistence' is set to 'None'.

Shouldn't this be 'ClientIP'?

Addon Manager Infinite Termination Loop

Is this a BUG REPORT or FEATURE REQUEST? (choose one): Not sure. I think its a bug unless I am misunderstanding something.

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm): Kubernetes v1.6.6

What happened: When I try to put new addons in the addon directory none of them work. They enter an infinite termination loop. Even attempting to add a simple NGINX deployment fails.

What you expected to happen: I expect the addons to spin up and complete deployment and have the addon manager reconcile them frequently. When creating new clusters I want to run Ansible to drop files in the addons directory to get storage classes, monitoring, etc configured. Whenever I try this on minikube it works fine.

How to reproduce it (as minimally and precisely as possible): Put anything in the /etc/kubernetes/addons folder. Add this metadata to the files:

apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  replicas: 2
  selector:
    app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

Example output:

nginx-v98bd   0/1       Pending   0         0s
nginx-vv77t   0/1       Pending   0         0s
nginx-v98bd   0/1       Pending   0         0s
nginx-vv77t   0/1       Pending   0         0s
nginx-v98bd   0/1       ContainerCreating   0         1s
nginx-vv77t   0/1       ContainerCreating   0         1s
nginx-v98bd   0/1       Terminating   0         5s
nginx-vv77t   0/1       Terminating   0         5s
nginx-v98bd   0/1       Terminating   0         5s
nginx-vv77t   0/1       Terminating   0         5s
nginx-v98bd   0/1       Terminating   0         7s
nginx-v98bd   0/1       Terminating   0         7s
nginx-vv77t   0/1       Terminating   0         8s
nginx-vv77t   0/1       Terminating   0         8s
nginx-rqj2t   0/1       Pending   0         1s
nginx-ph6wm   0/1       Pending   0         1s
nginx-rqj2t   0/1       Pending   0         1s
nginx-ph6wm   0/1       Pending   0         1s
nginx-rqj2t   0/1       ContainerCreating   0         1s
nginx-ph6wm   0/1       ContainerCreating   0         1s
nginx-ph6wm   0/1       Terminating   0         5s
nginx-rqj2t   0/1       Terminating   0         5s
nginx-rqj2t   0/1       Terminating   0         5s
nginx-ph6wm   0/1       Terminating   0         5s
nginx-rqj2t   0/1       Terminating   0         7s
nginx-rqj2t   0/1       Terminating   0         8s
nginx-ph6wm   0/1       Terminating   0         8s
nginx-ph6wm   0/1       Terminating   0         8s
nginx-vcwx2   0/1       Pending   0         0s
nginx-frm9f   0/1       Pending   0         0s
nginx-vcwx2   0/1       Pending   0         0s
nginx-frm9f   0/1       Pending   0         0s
nginx-vcwx2   0/1       ContainerCreating   0         0s
nginx-frm9f   0/1       ContainerCreating   0         0s
nginx-frm9f   1/1       Running   0         2s
nginx-vcwx2   1/1       Running   0         3s

no Container Service in azure portal after deploying kubernetes

hello,
I deployed the kubernetes cluster with acs-engine generated script, then deployed it (specified the cluster name ). when I logged into portal there is no azure container service, only master and agent nodes. when i do az acs show with specified name it says that cluster with that name doesn't exist. therefor I'm unable to use any az acs related commands. is it a bug or I'm doing anything wrong? double checked the service principal user and pass and they are correct

here is a template which i used in acs-engine
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.6.2",
"kubernetesConfig": {
"networkPolicy": "calico"
}
},
"masterProfile": {
"count": 1,
"dnsPrefix": "dnsprefix",
"vmSize": "Standard_D2_v2"
},
"agentPoolProfiles": [
{
"name": "agentpool1",
"count": 1,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet"
}
],
"linuxProfile": {
"adminUsername": "user",
"ssh": {
"publicKeys": [
{
"keyData": ""
}
]
}
},
"servicePrincipalProfile": {
"servicePrincipalClientID": "",
"servicePrincipalClientSecret": ""
}
}
}

Helm cannot be upgraded in ACS

Is this a request for help?:
No.

Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
k8s: Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:33:11Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

What happened:
ACS cluster created and installed with helm 2.5.1 works fine. Draft installed through helm 2.5.1. Helm 2.6.0 comes out along with draft 0.6. CANNOT REMOVE HELM 2.5.1. Want client features, which depend upon server portion, "tiller".

What you expected to happen:
I created the cluster. I called helm init and it installed tiller. I expect to be able to remove tiller and upgrade it. Instead, I CAN upgrade helm to 2.6, which merely creates a new svc for a moment, and then that svc gets drained and the original replaces it, always.

How to reproduce it (as minimally and precisely as possible):

  1. new ACS cluster, eastus.
  2. helm init (for version 2.5.1)
  3. draft init (version 0.6.rc2 or some such).
  4. download and install helm 2.6.0. Helm init --upgrade
  5. returns success. kubectl get po --all-namespaces will show the creation of a new tiller for a moment, only to be replaced by more tiller pods, which were created from 2.5.1.
  6. Try whatever you want: k delete deployment tiller-deploy --namespace kube-system`, anything. It comes back within twenty seconds.

Anything else we need to know:
Draft cannot be upgraded if helm is also upgraded, meaning that ACS is a bit crippled for these tools when you create the cluster yourself. I should not have to recreate the cluster just to upgrade helm unless I don't own the cluster.

NOTE
None of this is true for clusters that were created for you; it's entirely possible that I should not be able to remove a tiller version for THOSE clusters, but I expect we don't have that implemented yet.

tiller installed here in acs-engine, if it helps: https://github.com/Azure/acs-engine/blob/master/parts/kubernetesmasteraddons-tiller-deployment.yaml

Quote from Jason: "The add-on manager enforces state from the manifests living on the master(s). Need to dig more to see what version of the manager we are using, may be able to use addonmanager.kubernetes.io/mode=EnsureExists rather than mode=Reconcile to allow users to change the Tiller version."

Persistant Volumes in kubernetes

I've been trying to wrap my head around dynamic persistent storage for kubernetes on azure for a long time now, without success.

I see documentation on kubernets.io, in this repo, in blogs but they all seem vague to me and I never get it to work.

The issue might not belong here, since dynamic provisioning seems to be implemented and working. I'm now running against manually created storage blobs, and that is working. But dynamic provisioning just seems so awesome (and necessary in some cases) I want to know what I'm missing.

Is it as easy as it seems, or are there some requirements not covered by these tutorials? Could I spin up a cluster with az acs create and expect it to work, using storageClass for example.

Broken dashboard in Kubernetes cluster 1.7.7

Is this a request for help?:

yes

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

Client Version: v1.8.0
Server Version: v1.7.7

azure-cli (2.0.18)

What happened:

  • Created new Kubernetes cluster of the api-version=2017-07-01
  • Obtains kube config
  • Started kubectl proxy
  • Tried accessing ui dashboard

RESULT:
Unable to access dashboard at:
http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
assets are failing to load with 404

proxy	200	document	Other	1.1 KB	98 ms
vendor.9aa0b786.css	404	stylesheet	proxy	116 B	218 ms
app.8ebf2901.css	404	stylesheet	proxy	116 B	217 ms
vendor.840e639c.js	404	script	proxy	116 B	218 ms
appConfig.json	404	script	proxy	116 B	217 ms
app.68d2caa2.js	404	script	proxy	116 B	217 ms
appConfig.json	404	script	proxy	116 B	71 ms
app.68d2caa2.js	404	script	proxy	116 B	72 ms
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/static/app.8ebf2901.css net::ERR_ABORTED
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/static/vendor.9aa0b786.css net::ERR_ABORTED
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/api/appConfig.json net::ERR_ABORTED
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/static/app.68d2caa2.js net::ERR_ABORTED
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/static/vendor.840e639c.js net::ERR_ABORTED
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/api/appConfig.json net::ERR_ABORTED
GET http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/static/app.68d2caa2.js net::ERR_ABORTED

or (new access method since 1.7):
http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy

where i'm getting:

Error: 'tls: oversized record received with length 20527'
Trying to reach: 'https://10.244.3.2:9090/'

What you expected to happen:
I'm aware and RBAC is permissions model is now enabled, but there seems to be clusterrolebinding kubernetes-dashboard, so that should open it up.

How to reproduce it (as minimally and precisely as possible):

export CLUSTERNAME=dev0test0cluster123
az group create --name $CLUSTERNAME --location westus2
az acs create --orchestrator-type kubernetes --resource-group $CLUSTERNAME --name $CLUSTERNAME --master-count 3 --agent-count 3 --ssh-key-value <sensitive> --api-version 2017-07-01

...wait for VMs and system pods to start

az acs kubernetes get-credentials -n $CLUSTERNAME -g $CLUSTERNAME --file ~/.kube/config
kubectl proxy

Now try to access dashboard via any of the below urls.:
http://localhost:8001/ui

http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy

Anything else we need to know:
Following image in listed on dashboard deployment:
image: gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3

Also nothing interesting in logs of the dashboard pod:

Using HTTP port: 8443
Using in-cluster config to connect to apiserver
Using service account token for csrf signing
No request provided. Skipping authorization header
Successful initial request to the apiserver, version: v1.7.7
No request provided. Skipping authorization header
Creating in-cluster Heapster client
Successful initial request to heapster

Busy azure-disk regularly fail to mount causing K8S Pod deployments to halt.

I've setup Azure Container Service with Kubernetes and I use dynamic provisioning of volumes (see details below) when deploying new Pods. Quite frequently (about 10%) I get the following error which halts the deployment:

14h 1m 439 {controller-manager } Warning FailedMount Failed to attach volume "pvc-95aa8dbf-082e-11e7-af1a-000d3a2735d9" on node "k8s-agent-1da8a8df-2" with: Attach volume "clst-west-eu-dev-dynamic-pvc-95aa8dbf-082e-11e7-af1a-000d3a2735d9.vhd" to instance "k8s-agent-1DA8A8DF-2" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=409 -- Original Error: autorest/azure: Service returned an error. Status=409 Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'clst-west-eu-dev-dynamic-pvc-f843f8fa-0663-11e7-af1a-000d3a2735d9.vhd' to VM 'k8s-agent-1DA8A8DF-2' because the disk is currently being detached. Please wait until the disk is completely detached and then try again."

The Pod deployment then halts forever, or until I delete the Pod and let the ReplicationController create a new one.

Any idea what is causing this?

Workflow

I have created the following StorageClass:

Name:		azure-disk
IsDefaultClass:	No
Annotations:	<none>
Provisioner:	kubernetes.io/azure-disk
Parameters:	location=westeu,skuName=Standard_LRS,storageAccount=<<storageaccount>>

The storageaccount does contain a Blob service named vhds.

When deploying a new Pod, I create a PVC that looks like this:

{
  "apiVersion": "v1",
  "kind": "PersistentVolumeClaim",
  "Provisioner": "kubernetes.io/azure-disk",
  "metadata": {
    "name": "test-deployment-pvc",
    "annotations": {
      "volume.beta.kubernetes.io/storage-class": "azure-disk"
    },
    "labels": {
      "org": "somelabel"
    }
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "1Gi"
      }
    }
  }
}

and finally use the PVC in the pods:

{
  "volumes": [
    {
      "persistentVolumeClaim": {
        "claimName": "test-deployment-pvc"
      },
      "name": "storage"
    }
  ]
}

NC Family VMs available on subscription, but not allowed from azure-cli

Howdy folks, our subscription shows NC Family VMs as an option, which I've provisioned via acs-engine:

$ az vm list-usage -l southcentralus --query "[].{Resource:name.localizedValue,Limit:limit,CurrentValue:currentValue}" -o table
Resource                            Limit    CurrentValue
--------------------------------  -------  --------------
Availability Sets                    2000               4
Total Regional Cores                  200              28
Virtual Machines                    10000               8
Virtual Machine Scale Sets           2000
Standard Dv2 Family Cores              10               8
Standard D Family Cores               100               2
Standard NC Family Cores              100              18
...

Creating a Kubernetes cluster via az acs fails however. Using ACS v2.0.3. Is there something else we need to do, account or tool-wise?

$ az acs create --orchestrator-type=kubernetes --resource-group $RESOURCE_GROUP --name=$CLUSTER_NAME --dns-prefix=$DNS_PREFIX --ssh-key-value=$SSH_KEYFILE --agent-vm-size=Standard_NC6 --location=southcentralus
waiting for AAD role to propagate.done
At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details. {
  "error": {
    "code": "InvalidParameter",
    "message": "The value 'Standard_NC6' of parameter 'agentProfile.vmSize' is not allowed. Allowed values are 'Standard_A0, Standard_A1, Standard_A2, Standard_A3, Standard_A4, Standard_A5, Standard_A6, Standard_A7, Standard_A8, Standard_A9, Standard_A10, Standard_A11, Standard_D1, Standard_D2, Standard_D3, Standard_D4, Standard_D11, Standard_D12, Standard_D13, Standard_D14, Standard_D1_v2, Standard_D2_v2, Standard_D3_v2, Standard_D4_v2, Standard_D5_v2, Standard_D11_v2, Standard_D12_v2, Standard_D13_v2, Standard_D14_v2, Standard_G1, Standard_G2, Standard_G3, Standard_G4, Standard_G5, Standard_A1_v2, Standard_A2_v2, Standard_A2m_v2, Standard_A4_v2, Standard_A4m_v2, Standard_A8_v2, Standard_A8m_v2, Standard_D15_v2, Standard_F1, Standard_F16, Standard_F2, Standard_F4, Standard_F8, Standard_H16, Standard_H16m, Standard_H16mr, Standard_H16r, Standard_H8, Standard_H8m, Standard_DS1, Standard_DS2, Standard_DS3, Standard_DS4, Standard_DS11, Standard_DS12, Standard_DS13, Standard_DS14, Standard_GS1, Standard_GS2, Standard_GS3, Standard_GS4, Standard_GS5, Standard_DS1_v2, Standard_DS2_v2, Standard_DS3_v2, Standard_DS4_v2, Standard_DS5_v2, Standard_DS11_v2, Standard_DS12_v2, Standard_DS13_v2, Standard_DS14_v2, Standard_DS15_v2, Standard_F16s, Standard_F1s, Standard_F2s, Standard_F4s, Standard_F8s'."
  }

Kubernetes - Azure Load Balancer provisioning fails - name too long

Reporting of bug

Is this an ISSUE or FEATURE REQUEST? (choose one):

ISSUE:

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

ACS with Kubernetes deployed with Azure CLI 2.0:
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

What happened:

I've provisioned an ACS cluster with Kubernetes. When deploying a Pod with Service, I noticed that kube-controller-manager on master cannot provision Azure Load Balancer for the Service I deployed with type: LoadBalancer
If I do kubectl get services I see my service external endpoint remaining in state.

Having a closer look at kube-controller-manager logs, it looks like its reaching the Load Balancer naming character limit of 80.

Looks like a concatenation of resource group, ACS name and container name etc which results in a name that is too long for the load balancer ?

Logs from kube-controller-manager:
Ensuring LB for service default/bookingservice
2017-06-20T23:34:19.947310936Z E0620 23:34:19.946844 1 servicecontroller.go:779] Failed to process service. Retrying in 5m0s: Failed to create load balancer for service default/bookingservice: network.PublicIPAddressesClient#CreateOrUpdate: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="InvalidResourceName" Message="Resource name k8-test-demo-container-services-k8-training-9e1abf-ad8dc07c8558211e7a59f000d3ad0b58 is invalid. The name can be up to 80 characters long. It must begin with a word character, and it must end with a word character or with ''. The name may contain word characters or '.', '-', ''." Details=[]

What you expected to happen:

I expect my service to get an external IP by provisioning a load balancer when deploying a kubernetes service with type: LoadBalancer

How to reproduce it (as minimally and precisely as possible):

Deploy ACS with resource group name of 30 characters containing 1 or 2 "-" characters
ACS name has 12 characters with "-" characters
Example:

az acs create -g container-services-k8-training --name k8-test-demo --orchestrator-type kubernetes --generate-ssh-keys --service-principal $principal --client-secret $secret

Try deploy a service with type:LoadBalancer

echo "
kind: Service
apiVersion: v1
metadata:
  name: bookingservice
  namespace: default
spec:
  type: LoadBalancer
  selector:
    app: bookingservice
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
" | kubectl apply -f -

Anything else we need to know:

Apologies if this is known behaviour - I was unable to find an existing issue

Thanks,
Marcel

Service down time during kubernetes scale up and down

When scaling up or down an ACS Kubernetes cluster that has running services that are exposed publicly. Due to some contention between ACS and Kubernetes there is downtime for the publicly exposed services, my observations has been for about 3 mins during which new connections to the service will fail. This is a known issue with a known fix. I will update this issue with an ETA when it is known.

ACS Scale error message is incorrect

Originally reported as:
Azure/azure-cli#3759

Description

Outline the issue here:
Try scaling an ACS cluster above 100 nodes:

$ az acs scale -n $RESOURCE_GROUP --new-agent-count 500 -g $RESOURCE_GROUP
Parameter 'ContainerServiceAgentPoolProfile.count' must be less than 100.

The error says the agent count must be less than 100, but it works:

$ az acs scale -n $RESOURCE_GROUP --new-agent-count 100 -g $RESOURCE_GROUP
 / Running ..

Environment summary

Install Method: How did you install the CLI? (e.g. pip, interactive script, apt-get, Docker, MSI, nightly)
Answer here: $ curl -L https://aka.ms/InstallAzureCli | bash

CLI Version: What version of the CLI and modules are installed? (Use az --version)
Answer here: azure-cli (2.0.8)

OS Version: What OS and version are you using?
Answer here: MacOS Sierra 10.12.4

Shell Type: What shell are you using? (e.g. bash, cmd.exe, Bash on Windows)
Answer here: bash

@seanknox

Error wrt OpenApi

Bug Report

Kubernetes

I deployed K8s onto ACS via an ARM template and am now constantly receiving this error below.
Is this known?

What happened:
W1007 16:05:06.385371 10432 factory_object_mapping.go:423] Failed to download OpenAPI (the server could not find the requested resource), falling back to swagger

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
kubectl apply -f file.yaml

Anything else we need to know:

kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Heapster setup problem on ACS HA

Hi,

I trying to setup heapster with influxdb and grafana on a Azure Container Service HA cluster.
One of the master node was stopped.

k8s-agent-3871dc9d-0    Ready                         7h        v1.6.6
k8s-agent-3871dc9d-1    Ready                         7h        v1.6.6
k8s-master-3871dc9d-0   NotReady,SchedulingDisabled   7h        v1.6.6
k8s-master-3871dc9d-1   Ready,SchedulingDisabled      7h        v1.6.6
k8s-master-3871dc9d-2   Ready,SchedulingDisabled      7h        v1.6.6

Influxdb is running .
Grafana also running but there is no data inside the grafana.
I checked the heapster logs and says:

I0807 16:03:36.461051       1 heapster.go:72] /heapster --source=kubernetes:**** --sink=influxdb:http://monitoring-influxdb.kube-system.svc:8086
I0807 16:03:36.461105       1 heapster.go:73] Heapster version v1.4.0
I0807 16:03:36.461537       1 configs.go:61] Using Kubernetes client with master "https://10.0.0.1:443" and version v1
I0807 16:03:36.461587       1 configs.go:62] Using kubelet port 10255
I0807 16:03:36.603538       1 influxdb.go:278] created influxdb sink with options: host:monitoring-influxdb.kube-system.svc:8086 user:root db:k8s
I0807 16:03:36.603607       1 heapster.go:196] Starting with InfluxDB Sink
I0807 16:03:36.603622       1 heapster.go:196] Starting with Metric Sink
I0807 16:03:36.618332       1 heapster.go:106] Starting heapster on port 8082
E0807 16:04:05.000538       1 kubelet.go:280] Node k8s-master-3871dc9d-0 is not ready
I0807 16:04:05.257727       1 influxdb.go:241] Created database "k8s" on influxDB server at "monitoring-influxdb.kube-system.svc:8086"
E0807 16:05:05.000278       1 kubelet.go:280] Node k8s-master-3871dc9d-0 is not ready
E0807 16:06:05.000285       1 kubelet.go:280] Node k8s-master-3871dc9d-0 is not ready
E0807 16:07:05.000282       1 kubelet.go:280] Node k8s-master-3871dc9d-0 is not ready

I am using the loadbalancer domain name as source:

    - name: monitoring-heapster
        image: gcr.io/google_containers/heapster-amd64:v1.4.0
        imagePullPolicy: IfNotPresent
        command:
        - /heapster
        --source=kubernetes:https://[domain].westeurope.cloudapp.azure.com 
        --sink=influxdb:http://monitoring-influxdb.kube-system.svc:8086

Any idea why does not find the alive node in case of HA setup?

Br,
R

Heapster (1.3.0) returns message No metrics for pod/unable to get metrics for resource cpu: no metrics returned from heapster

Is this a request for help?:


YES

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes Version: v1.6.6
Heapster Version v1.3.0

What happened:
Heapster fails to collect metrics from pods. Heapster pod log message says No metrics for pod

I0809 06:44:08.573960       1 heapster.go:72] /heapster --source=kubernetes.summary_api:""
I0809 06:44:08.574000       1 heapster.go:73] Heapster version v1.3.0
I0809 06:44:08.574346       1 configs.go:61] Using Kubernetes client with master "https://10.0.0.1:443" and version v1
I0809 06:44:08.574365       1 configs.go:62] Using kubelet port 10255
I0809 06:44:08.575119       1 heapster.go:196] Starting with Metric Sink
I0809 06:44:08.675764       1 heapster.go:106] Starting heapster on port 8082
I0809 06:44:30.404310       1 handlers.go:215] No metrics for pod default/aspnetcoreapp-deployment-1249858457-bx5zl
I0809 06:44:30.404345       1 handlers.go:215] No metrics for pod default/aspnetcoreapp-deployment-1249858457-t6p46
I0809 06:44:30.404349       1 handlers.go:215] No metrics for pod default/aspnetcoreapp-deployment-1249858457-ln4r1

What you expected to happen:
Heapster should collect metrics from pods

How to reproduce it (as minimally and precisely as possible):
Create a new ACS Linux Cluster with Kubernetes as orchestrator.
Create a deployment, service and horizontalpodautoscaler as shown below.

Deployment:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: aspnetcoreapp-deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: aspnetcoreapp
    spec:
      containers:
      - name: aspnetcoreapp
        image: maksh/aspnetcorekuberappimg
        ports:
        - containerPort: 5000
        resources:
          requests:
            cpu: 1m

Service

apiVersion: v1
kind: Service
metadata:
  name: aspnetcoreapp-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
  selector:
    app: aspnetcoreapp

Horizontalpodautoscaler

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: aspnetcoreapp-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: aspnetcoreapp-deployment
  minReplicas: 3
  maxReplicas: 6
  targetCPUUtilizationPercentage: 20

Anything else we need to know:
kubectl describe hpa returns following

Name:                                                   aspnetcoreapp-hpa
Namespace:                                              default
Labels:                                                 <none>
Annotations:                                            <none>
CreationTimestamp:                                      Wed, 09 Aug 2017 12:31:13 +0530
Reference:                                              Deployment/aspnetcoreapp-deployment
Metrics:                                                ( current / target )
  resource cpu on pods  (as a percentage of request):   <unknown> / 20%
Min replicas:                                           3
Max replicas:                                           6
Events:
  FirstSeen     LastSeen        Count   From                            SubObjectPath   Type        Reason
                Message
  ---------     --------        -----   ----                            -------------   --------    ------
                -------
  1m            15s             4       horizontal-pod-autoscaler                       Warning     FailedGetResourceMetric             unable to get metrics for resource cpu: no metrics returned from heapster
  1m            15s             4       horizontal-pod-autoscaler                       Warning     FailedComputeMetricsReplicas        failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from heapster

Kubernetes deployed using ACS-Engine and Cluster Definition but kubectlcluster is not running inside the master.

Is this a request for help?:


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
I have Deployed kubernetes container service using ACS-Engine After Deploying i am facing following issues.

  1. After Deployment not able to connect to cluster using ssh key but able to connect to cluster using password.
  2. After connecting to cluster able to login machine master 01 but docker service is not running and kubectl is not installed in master server.
  3. I have tried to connect to cluster using kuectl cli from remote machine but not able to connect to cluster.Can any one guide me how to connect to cluster using cli

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Document Frequently Asked Questions (FAQs)

Candidates:

  • Why does ACS not support Docker Swarm?
  • What version of docker do I get?
  • Upgrades?
  • ACS vs ACS-Engine

We should also clearly cross-link ACS and ACS-Engine and make sure this FAQ is plainly accessible from ACS-Engine, but I guess maybe it should primarily live here?

Kubernetes Deployments are Flakey

This can manifest itself in a couple ways:

  1. The cluster is not healthy and the user is unable to issue kubectl commands, even from the master nodes.

  2. The user is unable to retrieve credentials using the cli (aka az acs get-credentials or az acs kubernetes get-credentials fails)

The issue stems from instability in an external dependency: hkp://ha.pool.sks-keyservers.net:80

We are currently hot fixing to mitigate this issue and building remediation instructions for broken clusters.

Unable to re-deploy when agent pool uses managed disks

Is this a request for help?:
Yes

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Bug

We're using managed disks mainly for the disk encryption that comes for free and requires no key management.

The deployment template has various other things in them, and we tend to redeploy often - we may add a new subscription to a Azure Service Bus topic for example.

We're not even trying to scale the cluster in any way.

Currently this bug will block the deployment.

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:

Deployment failed. {
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "OperationNotAllowed",
        "message": "Scaling a cluster with an agent pool(s) with managed disks is unsupported"
      }
    ]
  }
}

What you expected to happen:
If we're not trying to scale the agent pool (as the error says) - the deployment should pass.

How to reproduce it (as minimally and precisely as possible):
Deploy a new azure cluster service with managed disks for your agent pool. Immediately re-deploy, you'll see the error.

Anything else we need to know:
No.

Static IP on Kubernetes Loadblanacer

What happened:
created kubernetes service with field loadBalancerIP and pointed the ip address :

spec:
  clusterIP: 10.0.123.203
  loadBalancerIP: 52.166.122.228

the svc goes on error saying:

Error creating load balancer (will retry): Failed to create load balancer for service proxy/proxy: user supplied IP Address 52.166.122.228 was not found

when this ip address in fronted ip configuration of loadbalance and its not taken

the replace command works fine and the external ip address isn't changed. is there a way to bind IP address in azure to kubernetes service statically so it never changes?

[HelpQuestion] about exposing a service on ACS

Is this a request for help?:

When I execute the command to expose a service on ACS via:
kubectl expose deployments nginx --port=80 --type=LoadBalancer

A load balancer automatically gets created, with right backend servers in the server pool etc.

My question is how do the loadbalancers get instantiated and configured? Does this happen with the kubectl calling into the Azure CNI implementation which in turn instantiates the load balancer with the necessary configs?
Thanks,

  • Vinay

Unable to scale cluster with VNET peerings

I have a Windows Kubenetes ACS cluster. I am using VNET peering in order to allow the cluster to reach out to our internal resources. When I try to scale the cluster with the VNET peerings in place, I get the error in the screenshot below.

I removed the VNET peerings and the cluster was able to expand the number of agents. Then I added the peering back.

screen shot 2017-04-19 at 3 28 47 pm

Nginx-based Ingress Controller vs Azure Application Gateway

I have a customer who is migrating the containers from AWS to Azure, and are looking to use Kubernetes on ACS. They need SSL offloading and are wondering whether to use the Nginx-based Ingress Controller or the Azure Application Gateway.

According to the Status of Kubernetes on ACS (https://github.com/Azure/ACS/blob/master/kubernetes-status.md#future-work) the Azure Ingress Controller will leverage the service-based Application Gateway, so they are wondering whether this is a better long-term strategy.

Recommended upgrade method

We have several ACS clusters running right now but I'd like to get them upgraded to 1.5.x soon.

Is there a documented/recommended script to handle this? I've looked around and only found bits and pieces. Thanks!

(Also, to have a real bug here as well, the URLs on the README for the MS Code of Conduct are broken)

Kubernetes: Deleting cluster node should delete non working node in priority

Is this a BUG REPORT or FEATURE REQUEST?
FEATURE REQUEST

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes - 1.6.6

What happened:

  1. I setup a new ACS cluster (kubernetes) with two nodes
  2. I drained the first agent node.
  3. I update ACS cluster to only have one node
  4. It removed the latest created node which wasn't the drained node.
  5. Only pods run on master were working.

What you expected to happen:
I expect to keep working nodes in priority

How to reproduce it (as minimally and precisely as possible):

  1. Create a new cluster with two nodes
  2. Drain the first node 'XXXXX-0'
  3. Via UI, set node to 1

Anything else we need to know:

Unable to change agents count in ACS

I have deployed a wind + k8s cluster on azure container service and then I tried to change the agentpool from 1 to 2 agents and got the error.
screenshot 2017-03-16 10 38 57

I have also tried scaling using Azure CLI 2.0 and getting this error:
Parameter 'ContainerServiceWindowsProfile.admin_password' can not be None.

I have used this cmd:
az acs scale -g acs-test-d3 -n containerservice-acs-test-d3 --new-agent-count 3

SQL Server Linux Container running as StatefulSet shows status as Recovery Pending

Is this a request for help?:

YES

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

What happened:
I created a custom SQL Server Linux container by following process as described below.

  1. Create a standard SQL Sever Container by running docker run -d -p 1433:1433 -e "SA_PASSWORD=<password>" -e "ACCEPT_EULA=Y" microsoft/mssql-server-linux
  2. Connect to this container on 127.0.0.1, 1433 from SQL Server Management Studio and create a custom database by running following script
GO
IF(db_id(N'snoopyshoppingcart') IS NULL)
BEGIN
	CREATE DATABASE snoopyshoppingcart 
		ON
			(
				NAME = ssc_dat,
				FILENAME = N'/ssc/ssc.mdf'
			)
		LOG ON  
			( 
				NAME = ssc_log,  
				FILENAME = N'/ssc/ssc.ldf'
			)
END;

GO
USE snoopyshoppingcart

GO
IF NOT EXISTS (SELECT * FROM sysobjects WHERE NAME='shopping' AND XTYPE='U')
BEGIN
	CREATE TABLE shopping
	(
		AddedOn datetime,
		ConnectionID nvarchar(100),
		IP nvarchar(20),
		CartItem nvarchar(100)
	)
END

Note that database files are created on path /ssc. This is a deviation from standard SQL Server files path /var/opt/mssql/data. I want to use /ssc - custom path to use as mountpoint and mount it on Azure Disk when creating a statefulset for this container.
3. Run SHUTDOWN WITH NOWAIT.
4. Stop SQL Server Container.
5. Commit changes by running docker commit 8f maksh/snoopyshoppingcartdb and creating a new image.
6. Push this custom image by running docker push maksh/snoopyshoppingcartdb

When I create a container from this custom image and run locally (127.0.0.1, 1433), I can see that my custom database is available.

However, when I run this container as a statefulset on Kubernetes in Azure. Database shows status as Recovery Pending.
My k8s manifests below.

Secret (For sa password):

apiVersion: v1
kind: Secret
metadata:
  name: sqlsecret
type: Opaque
data:
  sapassword: UGFzc3dvcmQxMjM0

Storage Class:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: azurestorageclass
provisioner: kubernetes.io/azure-disk
parameters:
  skuName: Standard_LRS
  location: southeastasia
  storageAccount: <my-storage-account-in-same-rg-as-k8s>

Service:

apiVersion: v1
kind: Service
metadata:
  name: sqlservice
  labels:
    app: sqlservice
spec:
  type: LoadBalancer
  ports:
  - port: 1433
    targetPort: 1433
  selector:
    app: sqlinux

StatefulSet:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: sqlserverstatefulset
spec:
  serviceName: "sqlservice"
  replicas: 1
  template:
    metadata:
      labels:
        app: sqlinux
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: sqlinux
          image: maksh/snoopyshoppingcartdb
          env:
            - name: SA_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: sqlsecret
                  key: sapassword
            - name: ACCEPT_EULA
              value: "Y"
          ports:
            - containerPort: 1433
          volumeMounts:
            - name: sql-persistent-storage
              mountPath: "/ssc"
  volumeClaimTemplates:
  - metadata:
      name: sql-persistent-storage
      annotations:
        volume.beta.kubernetes.io/storage-class: "azurestorageclass"
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 16Gi

Pod logs following message: This message appears only on Kubernetes. Local container doesn't log this message.

2017-08-14 07:20:48.76 spid24s     Starting up database 'snoopyshoppingcart'.
 
2017-08-14 07:20:48.77 spid24s     Error: 17204, Severity: 16, State: 1.
 
2017-08-14 07:20:48.77 spid24s     FCB::Open failed: Could not open file /ssc/ssc.mdf for file number 1.  OS error: 2(The system cannot find the file specified.).
 
2017-08-14 07:20:48.78 spid9s      The resource database build version is 14.00.800. This is an informational message only. No user action is required.
 
2017-08-14 07:20:48.79 spid24s     Error: 5120, Severity: 16, State: 101.
 
2017-08-14 07:20:48.79 spid24s     Unable to open the physical file "/ssc/ssc.mdf". Operating system error 2: "2(The system cannot find the file specified.)".
 
2017-08-14 07:20:48.81 spid24s     Error: 17207, Severity: 16, State: 1.
 
2017-08-14 07:20:48.81 spid24s     FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file '/ssc/ssc.ldf'. Diagnose and correct the operating system error, and retry the operation.
 
2017-08-14 07:20:48.81 spid9s      Starting up database 'model'.
 
2017-08-14 07:20:48.83 spid24s     File activation failure. The physical file name "/ssc/ssc.ldf" may be incorrect.

When I connect with SQL Server using external IP of service, I see my custom database is shown with status Recovery Pending

What you expected to happen:
I expect custom SQL Server database should be mounted on Azure Disk and in operational state.

How to reproduce it (as minimally and precisely as possible):
Follow steps as mentioned above.

Anything else we need to know:
A VHD disk gets created in storage account. This matches persistentvolumeclaim .

Windows kubernetes clusters not working

The image we were using was susceptible to Wanna Cry at deploy time. It would eventually run windows update and patch itself so existing clusters should be good, but the Windows team pulled the image down. We are working on a new build with a more recent version of windows and should start rolling it out by EOD, and by 6/2 we should have it globally available. We obviously need to be better about avoiding things like this in the future before we GA this feature.
Existing clusters will no longer be able to scale up and down. To enable that feature a new cluster will need to be provisioned and brought up.

VMSS/auto-scaling support for Kubernetes on ACS

Hi,
At the moment, it appears that VMSS is not supported for Kubernetes nodes on ACS, only Availability Sets. This means that auto-scaling is not an option, and any node scaling needs to be done manually.
I took a look at ACS-Engine, but while that supports VMSS, it also does not support it for Kubernetes. So, I'm guessing that this is a limitation of the core ACS platform, is this correct?
In which case, it would be great to have VMSS support - I am dealing with customers who are moving their containers from AWS to Azure, but the lack of auto-scaling is a blocker.
Thanks,
James

Kubernetes Clusters Broken after Deployment or Unattended Upgrades

Problem

docker 1.13.0 was released and seems to break outbound container->internet traffic. New clusters get the latest docker automatically. Existing clusters will install updates automatically and will be affected by this as the nodes reboot for whatever reason.

Mitigation in ACS

The Azure Container Service is rolling out a fix that will pin docker-engine to 1.12.* for the time being.

This change is already live in the open-source core of ACS, ACS-Engine. You can see that change here: Azure/acs-engine#195

Workaround for Existing Clusters

These steps should allow you to workaround this issue. Please note, if you blindly execute the following, it will reboot all of your worker nodes one-by-one with no delay. Please decompose these instructions and drain/cordon/upgrade your nodes if you can't handle all nodes being rebooted at once.

These steps will downgrade docker on all VMs and create an apt preference policy effectively pinning docker-engine to 1.12.*.

# copy your SSH key to the master
scp ~/.ssh/id_rsa [email protected]

# ssh to master
ssh [email protected]

# workaround on all nodes (and reboot on all)
kubectl get nodes -o jsonpath={.items[*].metadata.name} \
    | tr ' ' '\n' \
    | grep -v master \
    | xargs -d '\n' -I "{}" \
        ssh -oStrictHostKeyChecking=no -i id_rsa {} bash -c "sudo apt-mark unhold docker-engine || true; \
            printf 'Package: docker-engine\nPin: version 1.12.*\nPin-Priority: 550\n' \
                | sudo tee /etc/apt/preferences.d/docker.pref; \
            sudo apt install -y --allow-downgrades docker-engine=1.12.6-0~ubuntu-xenial; \
            sudo nohup reboot &"

# workaround on master
sudo apt-mark unhold docker-engine || true; \
    printf 'Package: docker-engine\nPin: version 1.12.*\nPin-Priority: 550\n' \
        | sudo tee /etc/apt/preferences.d/docker.pref; \
    sudo apt install -y --allow-downgrades docker-engine=1.12.6-0~ubuntu-xenial; \
    sudo reboot

az acs list does not return resource group w/ new resource provider

Using the new resource provider to deploy a Kubernetes cluster, the az acs list does not return the name of the resource group as with the old resource provider:

az acs list -o table

Result:

Location    Name       ProvisioningState    ResourceGroup
----------  ---------  -------------------  ---------------
westeurope  jcok8s     Succeeded            JCOK8S-RG
uksouth     jcoacsuk1  Succeeded

Same with JSON output:

az acs list -o json

Result:

[
  {
    "agentPoolProfiles": [
      {
        "count": 1,
        "dnsPrefix": "MASKED-agents",
        "fqdn": "",
        "name": "agentpools",
        "vmSize": "Standard_D2_v2"
      }
    ],
    "customProfile": null,
    "diagnosticsProfile": {
      "vmDiagnostics": {
        "enabled": false,
        "storageUri": null
      }
    },
    "id": "MASKED",
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "MASKED"
          }
        ]
      }
    },
    "location": "westeurope",
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "MASKED",
      "fqdn": "MASKED.westeurope.cloudapp.azure.com"
    },
    "name": "jcok8s",
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes"
    },
    "provisioningState": "Succeeded",
    "resourceGroup": "JCOK8S-RG",
    "servicePrincipalProfile": {
      "clientId": "MASKED",
      "secret": null
    },
    "tags": null,
    "type": "Microsoft.ContainerService/ContainerServices",
    "windowsProfile": null
  },
  {
    "agentPoolProfiles": [
      {
        "count": 2,
        "dnsPrefix": "",
        "fqdn": "",
        "name": "linuxpool",
        "vmSize": "Standard_D2_v2"
      },
      {
        "count": 2,
        "dnsPrefix": "",
        "fqdn": "",
        "name": "windowspool",
        "vmSize": "Standard_D2_v2"
      }
    ],
    "customProfile": null,
    "diagnosticsProfile": null,
    "id": "MASKED",
    "linuxProfile": {
      "adminUsername": "jcorioland",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "MASKED"
          }
        ]
      }
    },
    "location": "uksouth",
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "MASKED",
      "fqdn": "MASKED.uksouth.cloudapp.azure.com"
    },
    "name": "jcoacsuk1",
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes"
    },
    "provisioningState": "Succeeded",
    "servicePrincipalProfile": {
      "clientId": "MASKED",
      "secret": null
    },
    "tags": null,
    "type": "Microsoft.ContainerService/ContainerServices",
    "windowsProfile": {
      "adminPassword": null,
      "adminUsername": "jcorioland"
    }
  }
]

ACS cluster is not ready to use after deployment template reports success

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG report.

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

Kubernetes.

What happened:

ACS cluster is not ready after getting created. Not able to make an SSH connection and run kubectl commands for a couple of minutes.

What you expected to happen:

ACS cluster should be ready to use when the creation template returns success. The creation should return success only after the cluster has Kubernetes binaries and other dependencies installed.

How to reproduce it (as minimally and precisely as possible):

  1. Create a new ACS cluster (I used the Azure SDK, ComputeClient.ContainerServices.CreateOrUpdateAsync)
  2. As soon as the cluster is created, try to create an SSH connection to the master. Connection will be aborted by the server.
  3. The SSH connection will eventually succeed, run any kubectl command on the cluster. An error: kubectl: command not found is returned.
  4. Kubectl commands will be available to use after some time.

Anything else we need to know:

On examining the /var/log/auth.log file on the cluster for SSH connection failures, this is the logged statement: fatal: Access denied for user azureuser by PAM account configuration [preauth].

ACS kubernetes hyperkube flaky

There's lots of hyperkube failed during ACS running.

image

I try to retrieved logs from one of kube-controller-manager.

E0509 22:02:57.817269       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured
E0509 22:53:05.605381       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured
E0511 01:39:12.416926       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured
E0512 07:41:17.543209       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured

Any solution for this? looks like the etcd maybe the culprit of all events.

Kubernetes secrets in Azure Key Vault

Hi all,

Is it possible to create/store a Kubernetes secret in Azure Key Vault, so that when you do a container deployment, the Kubernetes Master is able to query the Key Vault service for the secret value and use it in the deployment?

Thanks,.
James

Deleting Container Service leaves provisioned resources around

Migrating this from Azure/acs-engine#792


Ciao!

When provisioning an Azure Container Service instance through the API - lots of resources are created for things like Storage Accounts, VMSS etc, through what appears to be an ARM Template. When deleting the Container Service instance through the API these resources aren't cleared up - which we've had come through as issue hashicorp/terraform-provider-azurerm#79

Would it be possible to fix this? Thanks! :)

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
I tried DC/OS, using the top example on this page, but I'm assuming this is applicable to all.

What happened:
The Container Service got deleted, but all the provisioned resources got left behind.

What you expected to happen:
The resources provisioned by Azure Container Service would be cleaned up when deleting the Container Service instance.

How to reproduce it (as minimally and precisely as possible):
Delete the Container Service instance through the Azure API (we do this via terraform destroy --target=azurerm_container_service.test, but just a regular API call will work)

Anything else we need to know:

This bug was originally hashicorp/terraform#14128 before migrating to the new location: hashicorp/terraform-provider-azurerm#79

Enable RBAC in kubernetes orchestration

Is this a request for help?:
HELP

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
FEATURE REQUEST

I have deployed a kubernetes cluster (linux) using azure container service. I was trying to deploy prometheus-operator but found that RBAC is not enabled.

Is there any was enable RBAC in acs?

My query is that when will be RBAC supported in acs? I saw that ace-engine repository worked on RBAC in july but its not added to acs till now, How much time its takes for the new upgrades in ace-engine to appear in acs? And any timeline for kubernetes 1.7.x upgrade?

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes : 1.6.6

What happened:
Unable to create clusterrolebindings and unable to deploy prometheus-operator.

What you expected to happen:
Clusterrolebindings should be created.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Heapster pod won't start on ACS cluster provisioned with Kubernetes version 1.6.2

Details

Heapster pod won't start on ACS cluster provisioned with Kubernetes version 1.6.2

$ kubectl get pods -n kube-system
heapster-v1.2.0-559699904-ksjfs                 1/2       rpc error: code = 2 desc = failed to start container "2bb82c7425c1456ea78a4c743919173030ed9583cbfeaf66f76206237b6d036c": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"/heapster\\\\\\\": stat /heapster: no such file or directory\\\"\\n\""}   3          32s

Fix

Single Master

Prerequisites

  • Master FQDN - Found under the cluster in the Azure Portal
  • SSH key used to provision the cluster

Steps

  • Update Heapster addon deployment on master node
  ssh azureuzer@<master-fqdn> sudo sed -i 's/exechealthz-amd64:1.2/heapster:v1.2.0/' /etc/kubernetes/addons/kube-heapster-deployment.yaml
  • Using kubectl (either locally or on the master) delete the current Heapster deployment
  kubectl delete deploy heapster-v1.2.0 -n kube-system
  • Watch pods in kube-system namespace and wait for addon-manager to recreate the Heapster pods
$ kubectl get pods -n kube-system -w
heapster-v1.2.0-1516787090-cp68p   0/2       Pending   0         0s
heapster-v1.2.0-1516787090-cp68p   0/2       Pending   0         0s
heapster-v1.2.0-1516787090-cp68p   0/2       ContainerCreating   0         1s
heapster-v1.2.0-1516787090-cp68p   2/2       Running   0         3s

Multi-master

Prerequisites

  • Master FQDN - Found under the cluster in the Azure Portal
  • SSH key used to provision the cluster

Steps

  • Load the provisioned ssh-key into the local ssh-agent (this procedure will differ based on the ssh-client)
ssh-add <key-path>
ssh-add -l
  • Update the Heapster deployment manifest on all master nodes
for master in `kubectl get nodes -l role=master -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'`; \
    do echo $i; \
    ssh -A azureuzer@<master-fqdn> ssh $master -oStrictHostKeyChecking=no sudo sed -i 's/exechealthz-amd64:1.2/heapster:v1.2.0/' /etc/kubernetes/addons/kube-heapster-deployment.yaml; \
    done
  • Using kubectl (either locally or on the master) delete the current Heapster deployment
  kubectl delete deploy heapster-v1.2.0 -n kube-system
  • Watch pods in kube-system namespace and wait for addon-manager to recreate the Heapster pods
$ kubectl get pods -n kube-system -w
heapster-v1.2.0-1516787090-cp68p   0/2       Pending   0         0s
heapster-v1.2.0-1516787090-cp68p   0/2       Pending   0         0s
heapster-v1.2.0-1516787090-cp68p   0/2       ContainerCreating   0         1s
heapster-v1.2.0-1516787090-cp68p   2/2       Running   0         3s

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.