openshift / openshift-azure Goto Github PK

View Code? Open in Web Editor NEW

49.0 11.0 51.0 161.8 MB

Azure Red Hat Openshift

Home Page: https://azure.microsoft.com/en-us/services/openshift/

License: Apache License 2.0

Makefile 0.11% Go 95.16% Shell 4.68% Yacc 0.04% Dockerfile 0.01% Python 0.02%

openshift-azure's Introduction

openshift-azure

Prerequisites

Note that this README is targeted at AOS-Azure contributors. If you are not a member of this team, these instructions may not work as they will assume you have permissions that you may not have.

Utilities. Install the following:
1. Golang 1.11.6 (can also use package manager)
2. Latest Azure CLI
3. OpenShift Origin 3.11 client tools (can also use package manager)
4. Latest Glide. Note: Glide 0.13.1 is known to be broken.
5. jq (can also use package manager)
Development helper scripts assume an up-to-date GNU tools environment. Recent Linux distros should work out-of-the-box.

macOS ships with outdated BSD-based tools. We recommend installing macOS GNU tools.
Environment variables. Ensure that $GOPATH/bin is in your path:

export PATH=$PATH:${GOPATH:-$HOME/go}/bin.
Azure CLI access. Log into Azure using the CLI using az login and your credentials.
OpenShift CI cluster access. Log in to the CI cluster using oc login and a token from the CI cluster web interface. You can copy the required command by clicking on your username and the "Copy Login Command" option in the web portal.
Codebase. Check out the codebase:

go get github.com/openshift/openshift-azure/...

Secrets. Retrieve cluster creation secrets from the vault:

export VAULT_ADDR=https://vault.ci.openshift.org
./vault login $TOKEN_FROM_THE_VAULT
./vault kv get -format=json "kv/selfservice/azure/cluster-secrets-azure/" | jq ".data.data"  > vault-secrets.json
python3 vault-secrets.py

Environment file. Create an environment file:

cp env.example env.
AAD Application / Service principal. Create a personal AAD Application:
1. hack/aad.sh app-create user-$USER-aad aro-team-shared
2. Update env to include the AZURE_AAD_CLIENT_ID and AZURE_AAD_CLIENT_SECRET values output by aad.sh.
3. Ask an AAD administrator to grant permissions to your application.

Deploy an OpenShift cluster

Source the env file: . ./env.
Determine an appropriate resource group name for your cluster (e.g. for a test cluster, you could call it $USER-test). Then export RESOURCEGROUP and run ./hack/create.sh $RESOURCEGROUP to deploy a cluster.
Access the web console via the link printed by create.sh, logging in with your Azure credentials.
To inspect pods running on the OpenShift cluster, run KUBECONFIG=_data/_out/admin.kubeconfig oc get pods.
To ssh into any OpenShift master node, run ./hack/ssh.sh. You can directly ssh to any other host from the master. sudo -i will give root.
Run ./hack/delete.sh to delete the deployed cluster.

Examples

Basic OpenShift configuration (also see test/manifests/fakerp/create.yaml):

name: openshift
location: $AZURE_REGION
properties:
  openShiftVersion: v3.11
  authProfile:
    identityProviders:
    - name: Azure AD
      provider:
        kind: AADIdentityProvider
        clientId: $AZURE_AAD_CLIENT_ID
        secret: $AZURE_AAD_CLIENT_SECRET
        tenantId: $AZURE_TENANT_ID
  networkProfile:
    vnetCidr: 10.0.0.0/8
  masterPoolProfile:
    count: 3
    vmSize: Standard_D2s_v3
    subnetCidr: 10.0.0.0/24
  agentPoolProfiles:
  - name: infra
    role: infra
    count: 3
    vmSize: Standard_D2s_v3
    subnetCidr: 10.0.0.0/24
    osType: Linux
  - name: compute
    role: compute
    count: 1
    vmSize: Standard_D2s_v3
    subnetCidr: 10.0.0.0/24
    osType: Linux

CI infrastructure

Read more about how to work with our CI system here.

For any infrastructure-related issues, make sure to contact the Developer Productivity team who is responsible for managing the OpenShift CI Infrastructure at #forum-testplatform in Slack.

openshift-azure's People

Contributors

Stargazers

Watchers

openshift-azure's Issues

Directory permission is needed for the current user to register the application

Directory permission is needed for the current user to register the application. For how to configure, please refer 'https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal'. Original error: Insufficient privileges to complete the operation.
+ az ad sp create --id
az ad sp create: error: argument --id: expected one argument

@jim-minter

service catalog and bootstrap autoapprover crashloop until master lb is up

Can be avoided by using kubeconfigs that use the internal DNS name of the core apiserver.

Figure out whether we want to manage addons via image streams

https://github.com/jim-minter/azure-helm/tree/master/pkg/addons/data/ImageStream.image.openshift.io

I guess we need to agree on how addon updates are going to be triggered.

Use fully qualified URL in CCP service catalog service

To avoid hijacking the connection when someone creates a servicecatalog-api service in the inner cluster.

Use a more restricted kubeconfig for the service catalog apiserver

Currently uses admin.kubeconfig

osaapi branch vendoring broken

Current glide config does not configure dependencies right.
brach: externalapi

pkg/addons/addons.go:162:106: cannot use "github.com/jim-minter/azure-helm/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions literal (type "github.com/jim-minter/azure-helm/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions) as type "github.com/jim-minter/azure-helm/vendor/k8s.io/kube-aggregator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions in argument to ac.ApiregistrationV1().APIServices().Get
pkg/addons/addons.go:286:36: cannot use restconfig (type *"github.com/jim-minter/azure-helm/vendor/k8s.io/client-go/rest".Config) as type *"github.com/jim-minter/azure-helm/vendor/k8s.io/kube-aggregator/vendor/k8s.io/client-go/rest".Config in argument to clientset.NewForConfig

Figure out how to regenerate config for individual components

We should be able to regenerate config for individual components while retaining the rest of the config.

eg. I want to regenerate only the SC certificates, or the docker registry config, or ....

@jim-minter @pweil- @mjudeikis @kwoodson

create docker registry init container to setup storage account

Currently fails to deploy

panic: azure: malformed storage account key: illegal base64 data at input byte 4

goroutine 1 [running]:
github.com/openshift/image-registry/vendor/github.com/docker/distribution/registry/handlers.NewApp(0x7f25be9e0c40, 0x2215860, 0xc420401880, 0x7f25be9e0c40)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/vendor/github.com/docker/distribution/registry/handlers/app.go:123 +0x341e
github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp(0x218cf00, 0x2215860, 0xc420401880, 0x218cc00, 0xc420203420, 0xc4205158b0)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:89 +0xae
github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp(0x218cf00, 0x2215860, 0x2178140, 0xc4201421a0, 0xc420401880, 0xc420644960, 0x0, 0x0, 0xd7, 0x19b)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:107 +0x328
github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer(0x218cf00, 0x2215860, 0xc420401880, 0xc420644960, 0x0, 0x0, 0x21953e0)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:186 +0x19c
github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute(0x2173780, 0xc420142030)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:161 +0x9ed
main.main()
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:54 +0x37e

The maximum number of data disks allowed to be attached to a VM of this size is 4.

29m         29m          1         master-etcd-865689d86b-zx8hk.15427899f21d6546              Pod                                                    Warning   FailedAttachVolume      attachdetach-controller             AttachVolume.Attach failed for volume "pvc-bbcfe918-8a8b-11e8-9437-0a58ac1f0b47" : Attach volume "kubernetes-dynamic-pvc-bbcfe918-8a8b-11e8-9437-0a58ac1f0b47" to instance "/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/MC_aks_aks_eastus/providers/Microsoft.Compute/virtualMachines/aks-agentpool-35064155-0" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="InvalidParameter" Message="A disk at LUN 3 already exists."
19m         28m          5         master-etcd-865689d86b-zx8hk.154278a627a75eff              Pod                                                    Warning   FailedMount             kubelet, aks-agentpool-35064155-0   Unable to mount volumes for pod "master-etcd-865689d86b-zx8hk_kargakis-test(bc637f14-8a8b-11e8-9437-0a58ac1f0b47)": timeout expired waiting for volumes to attach or mount for pod "kargakis-test"/"master-etcd-865689d86b-zx8hk". list of unmounted volumes=[master-etcd]. list of unattached volumes=[master-config master-etcd default-token-b9jln]

etcd cannot get deployed; cluster is broken

We probably want to revert #65

@jim-minter @mjudeikis @kwoodson

Prioritize sdn creation in addon sync

We will be able to squeeze more time out of cluster bootstrap if we optimize addon creation (create SDN before any other pod so nodes can turn ready asap and new pods get scheduled as soon as they get created.

@jim-minter @pweil- @mjudeikis

[card/story] for BYO-VNET add internal LB option

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_loadbalancer.go#L37

for BYO VNET we would like to have the ability to use internal loadbalancer.

Test and check if anything needs to be done.

deploy ASB on CCP

Consider using k8s deployments vs DCs for infra components

default                            docker-registry-1-deploy         0/1       Error              0          1h
default                            registry-console-1-deploy        0/1       Error              0          1h
default                            router-1-deploy                  0/1       Error              0          1h
openshift-ansible-service-broker   asb-1-deploy                     0/1       Error              0          1h

It's unfortunate watching our infra deployment configs fail because the deployer pod never managed to run or failed. These processes do not need the extra APIs provided by DCs and should retry indefinitely.

service catalog does not come up, addon installation gets stuck

Probably because of https://github.com/jim-minter/azure-helm/blob/975dca4facd5c22af5b4c88182e55dd60e2a81ad/pkg/addons/addons.go#L153-L154

Tracking the TODO here

etcd operator storage, crd and certificate issues

The plan was to use etcd-operators. Now where we struggle.

First, we need to know the namespace when generating certificates. This is because of
https://github.com/coreos/etcd-operator/blob/70d3bd74960dc7127870a393affffbe1df94728e/pkg/util/etcdutil/member.go#L38-L40
The result is that etcd advertises itself with name.namespace.svc and we need to have this in the certificates.

Second (and a little bit bigger on) is storage.
First, etcd-operator online contains multiple misleading docs, examples. So we rely on code.

Etcd pods itself does not have any persistence.
https://github.com/coreos/etcd-operator/blob/master/pkg/apis/etcd/v1beta2/cluster.go#L137
Upstream issue:
coreos/etcd-operator#1323

Idea is we run in memory and backup constantly. In DR situation if a single pod is alive - the operator will recover. If all pods restart - recovery is done using etcd-restore-operator and restore is done from backup.

For this we need etcd-backup and etcd-restore operators.
backup operator supports 2 backup methods (Azure ASB and AWS S3) https://github.com/coreos/etcd-operator/blob/master/pkg/apis/etcd/v1beta2/backup_types.go#L19-L28

Configuration is what causes an issue. We need to have secret with storage account name and key.
https://github.com/coreos/etcd-operator/blob/master/doc/design/abs_backup.md

This means pre-requested are:

Storage account created.
Key available during creation of secret.

We don't want to create a storage account during ARM deployment as is not a client facing configuration and artifact. We could use one storage account with multiple buckets per customer. And inject from the backend.

Last one issue is helm ordering for CRD:
helm/helm#2994
TL;DR: When helm created CRD it takes some time for the cluster to accept them. Creating CRD resources fails as it's not yet available.

In addition, we dont want to manage global CRD's for all users from the user configuration side. If CRD is deleted - all etcd cluster are deleted too. It would look like we need to manage them outside azure-helm as part of HCP management.

cc: @jim-minter @Kargakis @pweil-

Investigate dropping RunSyncLocal from config

We can pass environment variables in helm via helm install --set, figure out whether we can get rid of the config field.

sync process shouldn't call Enrich

Enrich is a fix-up function. There should probably be fields in the external API representation for everything that Enrich adds, so that the function itself can be removed. The enriched external API representation can be passed over to the sync pod, and it then shouldn't need environment variables like RESOURCEGROUP to be set.

Testing pre-release with AKS/HCP

Currently our used aks model cant pull from our pre-release registry in aws:

  1m            1m              1       default-scheduler                                                       Normal          Scheduled               Successfully assigned master-controllers-75c58bd78-cw4s8 to aks-agentpool-94987606-0
  1m            1m              1       kubelet, aks-agentpool-94987606-0                                       Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "default-token-h2l5b"
  1m            1m              1       kubelet, aks-agentpool-94987606-0                                       Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "master-cloud-provider"
  1m            1m              1       kubelet, aks-agentpool-94987606-0                                       Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "master-config"    
  1m            30s             3       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Normal          Pulling                 pulling image "openshift3/ose-control-plane:v3.10.15-1"   
  1m            30s             3       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Warning         Failed                  Failed to pull image "openshift3/ose-control-plane:v3.10.15-1": rpc error: code = Unknown desc = Error response from daemon: repository openshift3/ose-control-plane not found: does not exist or no pull access                                                             
  1m            30s             3       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Warning         Failed                  Error: ErrImagePull                                       
  1m            8s              4       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Normal          BackOff                 Back-off pulling image "openshift3/ose-control-plane:v3.10.15-1"
  1m            8s              4       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Warning         Failed                  Error: ImagePullBackOff

so testing cant be done with pre-release rhel.

We need way to inject our pull secrets and define pre-release images with full FQDN when testing rhel :/

etcd operator needs cluster role binding

If we say we want to run one operator per customer (one operator managed one etcd cluster) we will need to create ClusterRole and ClusterRoleBinding for that operator serviceaccount (or default one).

This implies we:

Will have cluster-admin kubeconfig to do so.
We will have to know namespace of HCP in the config when generating.

If we would run one global operator we still need same, but this would be one-time setup for all HCP cluster. Ideas?

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: etcd-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: etcd-operator
subjects:
- kind: ServiceAccount
  name: default
  namespace: "{{ .Config.Namespace }}"

assuming ClusterRole etcd-operator is created already.

Cannot sign in to the webconsole using AAD

Request Id: cf18d201-c73f-4f1d-92fa-25b06c002200
Correlation Id: 1f095342-9f41-4703-8be7-a1556c1b9b8c
Timestamp: 2018-07-03T09:10:15Z
Message: AADSTS50011: The reply url specified in the request does not match the reply urls configured for the application: 'CLIENT_IDXXXXXXXXXXXXX'.

Api server log:

I0703 09:10:14.902183       1 handler.go:66] Authentication needed for {Azure AD 0xac33be0 {XXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX [openid] map[] https://login.microsoftonline.com/XXXXXXXXXXXXXXXXXXXXXXX/oauth2/authorize https://login.microsoftonline.com/XXXXXXXXXXXXXXXXXXXXXXX/oauth2/token  [sub] [unique_name] [email] [name] <nil>}}
I0703 09:10:14.902461       1 handler.go:78] redirect to https://login.microsoftonline.com/XXXXXXXXXXXXXXXXXXXXXXX/oauth2/authorize?client_id=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&redirect_uri=https%3A%2F%2Fopenshift.mkargaki-hcp.osadev.cloud%2Foauth2callback%2FAzure+AD&response_type=code&scope=openid&state=Y3NyZj1kODY3NDYxZC03ZTlmLTExZTgtYmU2NS0xYWViMWUzMjgyNjgmdGhlbj0lMkZvYXV0aCUyRmF1dGhvcml6ZSUzRmNsaWVudF9pZCUzRG9wZW5zaGlmdC13ZWItY29uc29sZSUyNmlkcCUzREF6dXJlJTJCQUQlMjZyZWRpcmVjdF91cmklM0RodHRwcyUyNTNBJTI1MkYlMjUyRm9wZW5zaGlmdC5ta2FyZ2FraS1oY3Aub3NhZGV2LmNsb3VkJTI1MkZjb25zb2xlJTI1MkZvYXV0aCUyNnJlc3BvbnNlX3R5cGUlM0Rjb2RlJTI2c3RhdGUlM0RleUowYUdWdUlqb2lMMk5oZEdGc2IyY2lMQ0p1YjI1alpTSTZJakUxTXpBMk1EZzFOakV4TURZdE16QXpOVFF4T1RreE1USTBOVGcyTXpBMU9EY3pNVFkzTXpNMk5qQXdNalkyTURrNU9EVXhNREl3TkRrMU5qY3lOak14TVRBNE1UVTJNelU0TVRRM09EQTNNRFkwTWpNNU9UQTBPVEkyTVRJaWZR
I0703 09:10:14.903042       1 wrap.go:42] GET /oauth/authorize?client_id=openshift-web-console&idp=Azure+AD&redirect_uri=https%3A%2F%2Fopenshift.mkargaki-hcp.osadev.cloud%2Fconsole%2Foauth&response_type=code&state=eyJ0aGVuIjoiL2NhdGFsb2ciLCJub25jZSI6IjE1MzA2MDg1NjExMDYtMzAzNTQxOTkxMTI0NTg2MzA1ODczMTY3MzM2NjAwMjY2MDk5ODUxMDIwNDk1NjcyNjMxMTA4MTU2MzU4MTQ3ODA3MDY0MjM5OTA0OTI2MTIifQ: (47.525557ms) 302 [[Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0] 10.244.0.9:51326]

router stats scrape fails

But the router comes up ok.

I0628 14:47:19.452212       1 template.go:244] Starting template router (v3.10.0-rc.0+9453255-21)
I0628 14:47:19.455144       1 metrics.go:159] Router health and metrics port listening at 0.0.0.0:1936
E0628 14:47:19.462591       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0628 14:47:19.530212       1 router.go:454] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0628 14:47:19.530318       1 router.go:252] Router is including routes in all namespaces
I0628 14:47:19.793799       1 router.go:454] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

simplify helm/data/values.yaml, helm/chart/templates/Secret/sync.yaml

both of these files are largely piping config entries around and must be updated every time we update configurables. Is it possible to simplify this?

don't hard-code 3 masters

at the moment this is moderately hard coded

sync process is not error prune

Sync process panics if dns is not resolvable:

+ go run cmd/sync/sync.go
panic: Get https://openshift.mjudeikis-etcd.osadev.cloud/healthz: dial tcp: lookup openshift.mjudeikis-etcd.osadev.cloud on 127.0.0.1:53: no such host

goroutine 1 [running]:
main.main()
        /home/mjudeiki/go/src/github.com/jim-minter/azure-helm/cmd/sync/sync.go:42 +0x1fb
exit status 2

cc: @jim-minter , this is what I was telling you on Friday :)

panic: couldn't find kind ClusterServiceBroker

2018/06/27 18:44:05 Create Secret/openshift-template-service-broker/templateservicebroker-client
panic: couldn't find kind ClusterServiceBroker

goroutine 1 [running]:
main.main()
	/home/kargakis/go/src/github.com/jim-minter/azure-helm/cmd/sync/sync.go:37 +0x1e8
exit status 2

During the addons installation

[Testing] testing requirements

For full CI/CD we will need to enhance our testing and image resolve capabilities.

We have a few types of OKD related testing.

Images build part of the core origin repo (all in the control-plain)
Images built from individual repositories (all in addons/plugins part)
Any commits to origin repo will need to get job triggered and ONLY control plain images overwritten (list https://github.com/openshift/release/blob/master/ci-operator/config/openshift/origin/master.json)
This means for our subset these images [1] will have to point to oreg_url:=testregistry/example/ci-pr-as-${component}-${version}
and all other images have to point to either upstream registry (registry.redhat.io for GA builds test and aws-reg for pre-ga builds.

[1]:

c.ControlPlaneImage = image(c.ImageConfigFormat, "control-plane", v)
c.NodeImage = image(c.ImageConfigFormat, "node", v)
c.RouterImage = image(c.ImageConfigFormat, "haproxy-router", v)

This implies for images like [2] we will need to overwrite to dynamic registry variable (either GA or pre-aws). *I suggest to add back DEV_REGISTRY variable and for 3.11 all other images are pointed to this one. for GA - they point to registry.redhat.io

[2]:

c.ServiceCatalogImage = image(c.ImageConfigFormat, "service-catalog", v)
c.AnsibleServiceBrokerImage = image(c.ImageConfigFormat, "ansible-service-broker", v)
c.TemplateServiceBrokerImage = image(c.ImageConfigFormat, "template-service-broker", v)
c.RegistryImage = image(c.ImageConfigFormat, "docker-registry", v)

The second set of jobs have to point to individual repositories for images like [3] and test them in our future E2E test by:

Installing either GA or Pre-GA stable release cluster.
Inject new images we need to test into sync plugin config and overwrite individual image we want to test.
Profit.

[3]

c.OAuthProxyImage = "registry.access.redhat.com/openshift3/oauth-proxy:" + v
c.PrometheusImage = "registry.access.redhat.com/openshift3/prometheus:" + v
c.PrometheusAlertBufferImage = "registry.access.redhat.com/openshift3/prometheus-alert-buffer:" + v
c.PrometheusAlertManagerImage = "registry.access.redhat.com/openshift3/prometheus-alertmanager:" + v
c.PrometheusNodeExporterImage = "registry.access.redhat.com/openshift3/prometheus-node-exporter:" + v

For for the first step in next few weeks (post GA) we need to rearrange
https://github.com/openshift/openshift-azure/blob/master/pkg/config/generate.go#L87-L90
to be split for core images and individual images and introduce DEV_REGISTRY variable to the equation.
Second individual ci-operator jobs need to be created.
Third e2e tests suit entry point created for:

Run unit tests
Create cluster
Run integration tests
Run smoke tests
Any other tests ?

cc: @openshift/sig-azure for discussion ?

fix and re-enable prometheus job 'kubernetes-controllers'

job_name: 'kubernetes-controllers' is disabled as it needs certain access to HCP to scrape the endpoints. Need to be fixed and re-enabled.

stop generating bootstrap autoapprover in the ccp

openshift router stats endpoints in non TLS

currently haproxy exposes metrics in non-tls mode:
openshift/origin#20375

When this solved upstream, fix our code to scrape TLS endpoints.

nodes cant "speak" to each other

Prometheus is running on infra and trying to scrape nodes (or versus).

[root@infra-000000 ~]# curl http://10.0.0.5:9100/metrics
curl: (7) Failed connect to 10.0.0.5:9100; No route to host
[root@infra-000000 ~]# ping 10.0.0.5
PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.
64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.747 ms
64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.781 ms

rollback firewall change for 9100

When this gets merged, rollback our change
openshift/openshift-ansible#9172 and follow this one

service catalog does not have enough access to run

I0625 13:48:41.747136       1 feature_gate.go:190] feature gates: map[OriginatingIdentity:true]
I0625 13:48:41.747270       1 hyperkube.go:192] Service Catalog version v3.10.0-rc.0+8d6748f-dirty (built 2018-06-23T19:27:11Z)
W0625 13:48:42.314600       1 authentication.go:232] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:hcp:default" cannot get configmaps in the namespace "kube-system": User "system:serviceaccount:hcp:default" cannot get configmaps in project "kube-system"

healthcheck panic

Saw this in #114 but it seems applicaple to master too.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd1a585]

goroutine 1 [running]:
github.com/openshift/openshift-azure/pkg/checks.WaitForHTTPStatusOk.func1(0xc42009e780, 0xc420535bef, 0x809801)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/checks/checks.go:48 +0x145
github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait.WaitFor(0xc4203e52c0, 0xc4204e8da0, 0x0, 0x1, 0xc4204e8da0)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:311 +0x6b
github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait.PollUntil(0x3b9aca00, 0xc4204e8da0, 0x0, 0x39, 0x0)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:290 +0x56
github.com/openshift/openshift-azure/pkg/checks.WaitForHTTPStatusOk(0x1194c20, 0xc42003e130, 0x117e9e0, 0xc420234690, 0xc420556c40, 0x39, 0x39, 0x40)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/checks/checks.go:31 +0x147
github.com/openshift/openshift-azure/pkg/addons.(*client).WaitForHealthz(0xc4201cf080, 0xc4201cf080, 0x0)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/addons/client.go:93 +0xf2
github.com/openshift/openshift-azure/pkg/addons.newClient(0x0, 0x16, 0xc4204b7dc0, 0x52f945, 0xee4e40)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/addons/client.go:76 +0x20a
github.com/openshift/openshift-azure/pkg/addons.Main(0xc42041d9e0, 0xc42003b680, 0x1df00, 0xee4e40, 0xc42000ee20)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/addons/addons.go:204 +0x42
main.sync(0xe9561b, 0x1091814)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/cmd/sync/sync.go:56 +0x24a
main.main()
	/home/kargakis/go/src/github.com/openshift/openshift-azure/cmd/sync/sync.go:68 +0x57
exit status 2

/kind bug

@kwoodson @jim-minter

Node role label missing from master nodes

$ oc get no master-000000 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2018-07-30T10:32:41Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: Standard_D2s_v3
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eastus
    failure-domain.beta.kubernetes.io/zone: "0"
    kubernetes.io/hostname: master-000000
  name: master-000000
<-- snip -->

whereas it's set correctly on infra and compute nodes

$ oc get no infra-000000 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2018-07-30T10:35:49Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: Standard_D2s_v3
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eastus
    failure-domain.beta.kubernetes.io/zone: "0"
    kubernetes.io/hostname: infra-000000
    node-role.kubernetes.io/infra: "true"
    region: infra
  name: infra-000000
<-- snip -->

$ oc get no compute-000000 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2018-07-30T10:35:49Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: Standard_D2s_v3
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eastus
    failure-domain.beta.kubernetes.io/zone: "0"
    kubernetes.io/hostname: compute-000000
    node-role.kubernetes.io/compute: "true"
    region: primary
  name: compute-000000
<-- snip -->

/kind bug

invalid SC certificate

Seeing the following in the core apiserver log

logging error output: "Error: 'x509: certificate is valid for servicecatalog-api, not apiserver.kube-service-catalog.svc'\nTrying to reach: 'https://servicecatalog-api:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s'"

Unclear how safe it is to add the CCP service name in the SC cert but opening a PR anyway to test and discuss.

@jim-minter @pweil- @liggitt

refactor validate

need to rebase the external api and refactor validate() return []error to make validate more testable

kubernetes service deployed in the ccp is broken

error: couldn't get deployment docker-registry-1: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/docker-registry-1: dial tcp 172.30.0.1:443: i/o timeout

Likely needs some iptables magic

rolebindings gets overwritten by upgrade.sh

How to recreate:

Create a cluster
oc adm policy add-cluster-role-to-user cluster-admin demo
Run upgrade.sh
Enjoy lost privileges.

vm-compute-0 can't contact apiserver

reproducer: run oc new-app nginx, deploy pod fails unable to contact apiserver. On the node, curl https://172.31.0.1 hangs. Doesn't seem to happen on vm-infra-0.

Move service catalog controller on the hcp

As discussed in scrum today

We will also want to clean up both daemon sets from the ccp

$ oc get ds
NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                         AGE
apiserver            0         0         0         0            0           node-role.kubernetes.io/master=true   2h
controller-manager   0         0         0         0            0           node-role.kubernetes.io/master=true   2h

Cluster service broker URLs need to be fully-qualified

I think that should work. Now, the SC controller manager tries to get the service brokers from HCP and of course it fails.

router fails to deploy

I0626 13:03:24.673908       1 template.go:244] Starting template router (v3.10.0-rc.0+e4d22b0-17)
I0626 13:03:24.675639       1 metrics.go:159] Router health and metrics port listening at 0.0.0.0:1936
E0626 13:03:24.682866       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0626 13:03:24.705162       1 router.go:252] Router is including routes in all namespaces
E0626 13:03:24.918625       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
E0626 13:03:24.941509       1 limiter.go:137] error reloading router: exit status 1
[ALERT] 176/130324 (30) : Starting frontend public_ssl: cannot bind socket [0.0.0.0:443]

Error while running sync: cannot sync cluster config

During a cluster deploy I saw the following error:

2018/07/27 13:54:10 Error while syncing: cannot sync cluster config: serviceaccounts "default-rolebindings-controller" already exists

+ go run cmd/sync/sync.go -run-once=true
2018/07/27 13:50:31 Sync process started!
2018/07/27 13:54:06 Update Namespace/default
2018/07/27 13:54:06 - Object.map[spec].map[finalizers].slice[1]: <no value> != openshift.io/origin
2018/07/27 13:54:06 - Object.map[metadata].map[annotations]: <does not have key> != map[openshift.io/node-selector:]
2018/07/27 13:54:06 Skip Namespace/kube-public
2018/07/27 13:54:06 Create Namespace/kube-service-catalog
2018/07/27 13:54:06 Skip Namespace/kube-system
2018/07/27 13:54:06 Update Namespace/openshift
2018/07/27 13:54:06 - Object.map[spec].map[finalizers].slice[1]: <no value> != openshift.io/origin
2018/07/27 13:54:06 Create Namespace/openshift-ansible-service-broker
2018/07/27 13:54:06 Create Namespace/openshift-azure
2018/07/27 13:54:06 Skip Namespace/openshift-infra
2018/07/27 13:54:06 Create Namespace/openshift-metrics
2018/07/27 13:54:06 Update Namespace/openshift-node
2018/07/27 13:54:06 - Object.map[metadata].map[annotations]: <does not have key> != map[openshift.io/node-selector:]
2018/07/27 13:54:07 Create Namespace/openshift-sdn
2018/07/27 13:54:07 Create Namespace/openshift-template-service-broker
2018/07/27 13:54:07 Create Namespace/openshift-web-console
2018/07/27 13:54:07 Create ServiceAccount/default/default
2018/07/27 13:54:07 Create ServiceAccount/default/registry
2018/07/27 13:54:07 Create ServiceAccount/default/router
2018/07/27 13:54:07 Create ServiceAccount/kube-service-catalog/service-catalog-apiserver
2018/07/27 13:54:07 Create ServiceAccount/kube-service-catalog/service-catalog-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/attachdetach-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/certificate-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/clusterrole-aggregation-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/cronjob-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/daemon-set-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/deployment-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/disruption-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/endpoint-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/generic-garbage-collector
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/job-controller
2018/07/27 13:54:09 Create ServiceAccount/kube-system/namespace-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/node-controller
2018/07/27 13:54:09 Create ServiceAccount/kube-system/persistent-volume-binder
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/pod-garbage-collector
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/pv-protection-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/pvc-protection-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/replicaset-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/replication-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/resourcequota-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/service-account-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/service-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/statefulset-controller
2018/07/27 13:54:09 Create ServiceAccount/openshift-ansible-service-broker/asb
2018/07/27 13:54:09 Create ServiceAccount/openshift-ansible-service-broker/asb-client
2018/07/27 13:54:09 Create ServiceAccount/openshift-infra/bootstrap-autoapprover
2018/07/27 13:54:09 Skip ServiceAccount/openshift-infra/build-config-change-controller
2018/07/27 13:54:10 Skip ServiceAccount/openshift-infra/build-controller
2018/07/27 13:54:10 Skip ServiceAccount/openshift-infra/cluster-quota-reconciliation-controller
2018/07/27 13:54:10 Create ServiceAccount/openshift-infra/default-rolebindings-controller
2018/07/27 13:54:10 Error while syncing: cannot sync cluster config: serviceaccounts "default-rolebindings-controller" already exists
+ az group deployment wait -g kwoodsontest -n azuredeploy --created --interval 10
+ KUBECONFIG=_data/_out/admin.kubeconfig
+ go run cmd/healthcheck/healthcheck.go
panic: unexpected error code 403 from console

goroutine 1 [running]:
main.main()
	/home/kwoodson/go/src/github.com/openshift/openshift-azure/cmd/healthcheck/healthcheck.go:53 +0x44
exit status 2

oc logs does not work

$ oc logs -f dc/router
Error from server: Get https://vm-infra-0:10250/containerLogs/default/router-1-deploy/deployment?follow=true: dial tcp: lookup vm-infra-0 on 172.17.0.1:53: no such host

2018/07/20 20:29:30 Delete clusterIP, map[annotations:map[prometheus.io/port:1936 prometheus.io/scrape:true prometheus.openshift.io/password:75XmZa2KNV prometheus.openshift.io/username:admin] name:router-stats namespace:default selfLink:/api/v1/namespaces/default/services/router-stats uid:d8328c46-8c4c-11e8-b835-c2b37693715e resourceVersion:2792 creationTimestamp:2018-07-20T18:43:50Z labels:map[router:router-stats]]
2018/07/20 20:29:30 Update Service/default/router-stats
2018/07/20 20:29:30 - Object.map[metadata].map[annotations].map[prometheus.openshift.io/password]: 75XmZa2KNV != vgIdDy3bPG
2018/07/20 20:29:30 Error while syncing: Service "router-stats" is invalid: spec.clusterIP: Invalid value: "": field is immutable
+ KUBECONFIG=_data/_out/admin.kubeconfig

cc @jim-minter looks like handling clusterIP is not only for Type Loadbalancer.