Giter Club home page Giter Club logo

openshift-azure's Introduction

openshift-azure

Coverage Status Go Report Card GoDoc

Prerequisites

Note that this README is targeted at AOS-Azure contributors. If you are not a member of this team, these instructions may not work as they will assume you have permissions that you may not have.

  1. Utilities. Install the following:

    1. Golang 1.11.6 (can also use package manager)
    2. Latest Azure CLI
    3. OpenShift Origin 3.11 client tools (can also use package manager)
    4. Latest Glide. Note: Glide 0.13.1 is known to be broken.
    5. jq (can also use package manager)

    Development helper scripts assume an up-to-date GNU tools environment. Recent Linux distros should work out-of-the-box.

    macOS ships with outdated BSD-based tools. We recommend installing macOS GNU tools.

  2. Environment variables. Ensure that $GOPATH/bin is in your path:

    export PATH=$PATH:${GOPATH:-$HOME/go}/bin.

  3. Azure CLI access. Log into Azure using the CLI using az login and your credentials.

  4. OpenShift CI cluster access. Log in to the CI cluster using oc login and a token from the CI cluster web interface. You can copy the required command by clicking on your username and the "Copy Login Command" option in the web portal.

  5. Codebase. Check out the codebase:

    go get github.com/openshift/openshift-azure/...

  6. Secrets. Retrieve cluster creation secrets from the vault:

    export VAULT_ADDR=https://vault.ci.openshift.org
    ./vault login $TOKEN_FROM_THE_VAULT
    ./vault kv get -format=json "kv/selfservice/azure/cluster-secrets-azure/" | jq ".data.data"  > vault-secrets.json
    python3 vault-secrets.py
    
  7. Environment file. Create an environment file:

    cp env.example env.

  8. AAD Application / Service principal. Create a personal AAD Application:

    1. hack/aad.sh app-create user-$USER-aad aro-team-shared
    2. Update env to include the AZURE_AAD_CLIENT_ID and AZURE_AAD_CLIENT_SECRET values output by aad.sh.
    3. Ask an AAD administrator to grant permissions to your application.

Deploy an OpenShift cluster

  1. Source the env file: . ./env.

  2. Determine an appropriate resource group name for your cluster (e.g. for a test cluster, you could call it $USER-test). Then export RESOURCEGROUP and run ./hack/create.sh $RESOURCEGROUP to deploy a cluster.

  3. Access the web console via the link printed by create.sh, logging in with your Azure credentials.

  4. To inspect pods running on the OpenShift cluster, run KUBECONFIG=_data/_out/admin.kubeconfig oc get pods.

  5. To ssh into any OpenShift master node, run ./hack/ssh.sh. You can directly ssh to any other host from the master. sudo -i will give root.

  6. Run ./hack/delete.sh to delete the deployed cluster.

Examples

Basic OpenShift configuration (also see test/manifests/fakerp/create.yaml):

name: openshift
location: $AZURE_REGION
properties:
  openShiftVersion: v3.11
  authProfile:
    identityProviders:
    - name: Azure AD
      provider:
        kind: AADIdentityProvider
        clientId: $AZURE_AAD_CLIENT_ID
        secret: $AZURE_AAD_CLIENT_SECRET
        tenantId: $AZURE_TENANT_ID
  networkProfile:
    vnetCidr: 10.0.0.0/8
  masterPoolProfile:
    count: 3
    vmSize: Standard_D2s_v3
    subnetCidr: 10.0.0.0/24
  agentPoolProfiles:
  - name: infra
    role: infra
    count: 3
    vmSize: Standard_D2s_v3
    subnetCidr: 10.0.0.0/24
    osType: Linux
  - name: compute
    role: compute
    count: 1
    vmSize: Standard_D2s_v3
    subnetCidr: 10.0.0.0/24
    osType: Linux

CI infrastructure

Read more about how to work with our CI system here.

For any infrastructure-related issues, make sure to contact the Developer Productivity team who is responsible for managing the OpenShift CI Infrastructure at #forum-testplatform in Slack.

openshift-azure's People

Contributors

0xmichalis avatar asalkeld avatar charlesakalugwu avatar ehashman avatar gburges avatar gvanderpotte avatar hawkowl avatar jaormx avatar jim-minter avatar m1kola avatar makdaam avatar mgahagan73 avatar mjudeikis avatar nilsanderselde avatar olga-mir avatar olga-mirensky avatar openshift-azure-robot avatar openshift-merge-robot avatar pweil- avatar sjkingo avatar sozercan avatar thekad avatar troy0820 avatar yhcote avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openshift-azure's Issues

router fails to deploy

I0626 13:03:24.673908       1 template.go:244] Starting template router (v3.10.0-rc.0+e4d22b0-17)
I0626 13:03:24.675639       1 metrics.go:159] Router health and metrics port listening at 0.0.0.0:1936
E0626 13:03:24.682866       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0626 13:03:24.705162       1 router.go:252] Router is including routes in all namespaces
E0626 13:03:24.918625       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
E0626 13:03:24.941509       1 limiter.go:137] error reloading router: exit status 1
[ALERT] 176/130324 (30) : Starting frontend public_ssl: cannot bind socket [0.0.0.0:443]

router stats scrape fails

But the router comes up ok.

I0628 14:47:19.452212       1 template.go:244] Starting template router (v3.10.0-rc.0+9453255-21)
I0628 14:47:19.455144       1 metrics.go:159] Router health and metrics port listening at 0.0.0.0:1936
E0628 14:47:19.462591       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0628 14:47:19.530212       1 router.go:454] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0628 14:47:19.530318       1 router.go:252] Router is including routes in all namespaces
I0628 14:47:19.793799       1 router.go:454] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

Error while running sync: cannot sync cluster config

During a cluster deploy I saw the following error:

2018/07/27 13:54:10 Error while syncing: cannot sync cluster config: serviceaccounts "default-rolebindings-controller" already exists
+ go run cmd/sync/sync.go -run-once=true
2018/07/27 13:50:31 Sync process started!
2018/07/27 13:54:06 Update Namespace/default
2018/07/27 13:54:06 - Object.map[spec].map[finalizers].slice[1]: <no value> != openshift.io/origin
2018/07/27 13:54:06 - Object.map[metadata].map[annotations]: <does not have key> != map[openshift.io/node-selector:]
2018/07/27 13:54:06 Skip Namespace/kube-public
2018/07/27 13:54:06 Create Namespace/kube-service-catalog
2018/07/27 13:54:06 Skip Namespace/kube-system
2018/07/27 13:54:06 Update Namespace/openshift
2018/07/27 13:54:06 - Object.map[spec].map[finalizers].slice[1]: <no value> != openshift.io/origin
2018/07/27 13:54:06 Create Namespace/openshift-ansible-service-broker
2018/07/27 13:54:06 Create Namespace/openshift-azure
2018/07/27 13:54:06 Skip Namespace/openshift-infra
2018/07/27 13:54:06 Create Namespace/openshift-metrics
2018/07/27 13:54:06 Update Namespace/openshift-node
2018/07/27 13:54:06 - Object.map[metadata].map[annotations]: <does not have key> != map[openshift.io/node-selector:]
2018/07/27 13:54:07 Create Namespace/openshift-sdn
2018/07/27 13:54:07 Create Namespace/openshift-template-service-broker
2018/07/27 13:54:07 Create Namespace/openshift-web-console
2018/07/27 13:54:07 Create ServiceAccount/default/default
2018/07/27 13:54:07 Create ServiceAccount/default/registry
2018/07/27 13:54:07 Create ServiceAccount/default/router
2018/07/27 13:54:07 Create ServiceAccount/kube-service-catalog/service-catalog-apiserver
2018/07/27 13:54:07 Create ServiceAccount/kube-service-catalog/service-catalog-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/attachdetach-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/certificate-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/clusterrole-aggregation-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/cronjob-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/daemon-set-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/deployment-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/disruption-controller
2018/07/27 13:54:08 Create ServiceAccount/kube-system/endpoint-controller
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/generic-garbage-collector
2018/07/27 13:54:08 Skip ServiceAccount/kube-system/job-controller
2018/07/27 13:54:09 Create ServiceAccount/kube-system/namespace-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/node-controller
2018/07/27 13:54:09 Create ServiceAccount/kube-system/persistent-volume-binder
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/pod-garbage-collector
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/pv-protection-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/pvc-protection-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/replicaset-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/replication-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/resourcequota-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/service-account-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/service-controller
2018/07/27 13:54:09 Skip ServiceAccount/kube-system/statefulset-controller
2018/07/27 13:54:09 Create ServiceAccount/openshift-ansible-service-broker/asb
2018/07/27 13:54:09 Create ServiceAccount/openshift-ansible-service-broker/asb-client
2018/07/27 13:54:09 Create ServiceAccount/openshift-infra/bootstrap-autoapprover
2018/07/27 13:54:09 Skip ServiceAccount/openshift-infra/build-config-change-controller
2018/07/27 13:54:10 Skip ServiceAccount/openshift-infra/build-controller
2018/07/27 13:54:10 Skip ServiceAccount/openshift-infra/cluster-quota-reconciliation-controller
2018/07/27 13:54:10 Create ServiceAccount/openshift-infra/default-rolebindings-controller
2018/07/27 13:54:10 Error while syncing: cannot sync cluster config: serviceaccounts "default-rolebindings-controller" already exists
+ az group deployment wait -g kwoodsontest -n azuredeploy --created --interval 10
+ KUBECONFIG=_data/_out/admin.kubeconfig
+ go run cmd/healthcheck/healthcheck.go
panic: unexpected error code 403 from console

goroutine 1 [running]:
main.main()
	/home/kwoodson/go/src/github.com/openshift/openshift-azure/cmd/healthcheck/healthcheck.go:53 +0x44
exit status 2

clean prometheus config

Prometheus currently has default config to scrape all components, including control plain. need to stip it down.
screenshot from 2018-07-17 12-08-09

Directory permission is needed for the current user to register the application

Directory permission is needed for the current user to register the application. For how to configure, please refer 'https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal'. Original error: Insufficient privileges to complete the operation.
+ az ad sp create --id
az ad sp create: error: argument --id: expected one argument

@jim-minter

oc logs does not work

$ oc logs -f dc/router
Error from server: Get https://vm-infra-0:10250/containerLogs/default/router-1-deploy/deployment?follow=true: dial tcp: lookup vm-infra-0 on 172.17.0.1:53: no such host

kubernetes service deployed in the ccp is broken

error: couldn't get deployment docker-registry-1: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/docker-registry-1: dial tcp 172.30.0.1:443: i/o timeout

Likely needs some iptables magic

sync process is not error prune

Sync process panics if dns is not resolvable:

+ go run cmd/sync/sync.go
panic: Get https://openshift.mjudeikis-etcd.osadev.cloud/healthz: dial tcp: lookup openshift.mjudeikis-etcd.osadev.cloud on 127.0.0.1:53: no such host

goroutine 1 [running]:
main.main()
        /home/mjudeiki/go/src/github.com/jim-minter/azure-helm/cmd/sync/sync.go:42 +0x1fb
exit status 2

cc: @jim-minter , this is what I was telling you on Friday :)

refactor validate

need to rebase the external api and refactor validate() return []error to make validate more testable

service catalog does not have enough access to run

I0625 13:48:41.747136       1 feature_gate.go:190] feature gates: map[OriginatingIdentity:true]
I0625 13:48:41.747270       1 hyperkube.go:192] Service Catalog version v3.10.0-rc.0+8d6748f-dirty (built 2018-06-23T19:27:11Z)
W0625 13:48:42.314600       1 authentication.go:232] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:hcp:default" cannot get configmaps in the namespace "kube-system": User "system:serviceaccount:hcp:default" cannot get configmaps in project "kube-system"

The maximum number of data disks allowed to be attached to a VM of this size is 4.

29m         29m          1         master-etcd-865689d86b-zx8hk.15427899f21d6546              Pod                                                    Warning   FailedAttachVolume      attachdetach-controller             AttachVolume.Attach failed for volume "pvc-bbcfe918-8a8b-11e8-9437-0a58ac1f0b47" : Attach volume "kubernetes-dynamic-pvc-bbcfe918-8a8b-11e8-9437-0a58ac1f0b47" to instance "/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/MC_aks_aks_eastus/providers/Microsoft.Compute/virtualMachines/aks-agentpool-35064155-0" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="InvalidParameter" Message="A disk at LUN 3 already exists."
19m         28m          5         master-etcd-865689d86b-zx8hk.154278a627a75eff              Pod                                                    Warning   FailedMount             kubelet, aks-agentpool-35064155-0   Unable to mount volumes for pod "master-etcd-865689d86b-zx8hk_kargakis-test(bc637f14-8a8b-11e8-9437-0a58ac1f0b47)": timeout expired waiting for volumes to attach or mount for pod "kargakis-test"/"master-etcd-865689d86b-zx8hk". list of unmounted volumes=[master-etcd]. list of unattached volumes=[master-config master-etcd default-token-b9jln]

etcd cannot get deployed; cluster is broken

We probably want to revert #65

@jim-minter @mjudeikis @kwoodson

Move service catalog controller on the hcp

As discussed in scrum today

We will also want to clean up both daemon sets from the ccp

$ oc get ds
NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                         AGE
apiserver            0         0         0         0            0           node-role.kubernetes.io/master=true   2h
controller-manager   0         0         0         0            0           node-role.kubernetes.io/master=true   2h

osaapi branch vendoring broken

Current glide config does not configure dependencies right.
brach: externalapi

pkg/addons/addons.go:162:106: cannot use "github.com/jim-minter/azure-helm/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions literal (type "github.com/jim-minter/azure-helm/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions) as type "github.com/jim-minter/azure-helm/vendor/k8s.io/kube-aggregator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1".GetOptions in argument to ac.ApiregistrationV1().APIServices().Get
pkg/addons/addons.go:286:36: cannot use restconfig (type *"github.com/jim-minter/azure-helm/vendor/k8s.io/client-go/rest".Config) as type *"github.com/jim-minter/azure-helm/vendor/k8s.io/kube-aggregator/vendor/k8s.io/client-go/rest".Config in argument to clientset.NewForConfig

healthcheck panic

Saw this in #114 but it seems applicaple to master too.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd1a585]

goroutine 1 [running]:
github.com/openshift/openshift-azure/pkg/checks.WaitForHTTPStatusOk.func1(0xc42009e780, 0xc420535bef, 0x809801)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/checks/checks.go:48 +0x145
github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait.WaitFor(0xc4203e52c0, 0xc4204e8da0, 0x0, 0x1, 0xc4204e8da0)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:311 +0x6b
github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait.PollUntil(0x3b9aca00, 0xc4204e8da0, 0x0, 0x39, 0x0)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:290 +0x56
github.com/openshift/openshift-azure/pkg/checks.WaitForHTTPStatusOk(0x1194c20, 0xc42003e130, 0x117e9e0, 0xc420234690, 0xc420556c40, 0x39, 0x39, 0x40)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/checks/checks.go:31 +0x147
github.com/openshift/openshift-azure/pkg/addons.(*client).WaitForHealthz(0xc4201cf080, 0xc4201cf080, 0x0)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/addons/client.go:93 +0xf2
github.com/openshift/openshift-azure/pkg/addons.newClient(0x0, 0x16, 0xc4204b7dc0, 0x52f945, 0xee4e40)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/addons/client.go:76 +0x20a
github.com/openshift/openshift-azure/pkg/addons.Main(0xc42041d9e0, 0xc42003b680, 0x1df00, 0xee4e40, 0xc42000ee20)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/pkg/addons/addons.go:204 +0x42
main.sync(0xe9561b, 0x1091814)
	/home/kargakis/go/src/github.com/openshift/openshift-azure/cmd/sync/sync.go:56 +0x24a
main.main()
	/home/kargakis/go/src/github.com/openshift/openshift-azure/cmd/sync/sync.go:68 +0x57
exit status 2

/kind bug

@kwoodson @jim-minter

invalid SC certificate

Seeing the following in the core apiserver log

logging error output: "Error: 'x509: certificate is valid for servicecatalog-api, not apiserver.kube-service-catalog.svc'\nTrying to reach: 'https://servicecatalog-api:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s'"

Unclear how safe it is to add the CCP service name in the SC cert but opening a PR anyway to test and discuss.

@jim-minter @pweil- @liggitt

nodes cant "speak" to each other

Prometheus is running on infra and trying to scrape nodes (or versus).

[root@infra-000000 ~]# curl http://10.0.0.5:9100/metrics
curl: (7) Failed connect to 10.0.0.5:9100; No route to host
[root@infra-000000 ~]# ping 10.0.0.5
PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.
64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.747 ms
64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.781 ms

screenshot from 2018-07-17 14-07-53

Cannot sign in to the webconsole using AAD

Request Id: cf18d201-c73f-4f1d-92fa-25b06c002200
Correlation Id: 1f095342-9f41-4703-8be7-a1556c1b9b8c
Timestamp: 2018-07-03T09:10:15Z
Message: AADSTS50011: The reply url specified in the request does not match the reply urls configured for the application: 'CLIENT_IDXXXXXXXXXXXXX'. 

Api server log:

I0703 09:10:14.902183       1 handler.go:66] Authentication needed for {Azure AD 0xac33be0 {XXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX [openid] map[] https://login.microsoftonline.com/XXXXXXXXXXXXXXXXXXXXXXX/oauth2/authorize https://login.microsoftonline.com/XXXXXXXXXXXXXXXXXXXXXXX/oauth2/token  [sub] [unique_name] [email] [name] <nil>}}
I0703 09:10:14.902461       1 handler.go:78] redirect to https://login.microsoftonline.com/XXXXXXXXXXXXXXXXXXXXXXX/oauth2/authorize?client_id=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&redirect_uri=https%3A%2F%2Fopenshift.mkargaki-hcp.osadev.cloud%2Foauth2callback%2FAzure+AD&response_type=code&scope=openid&state=Y3NyZj1kODY3NDYxZC03ZTlmLTExZTgtYmU2NS0xYWViMWUzMjgyNjgmdGhlbj0lMkZvYXV0aCUyRmF1dGhvcml6ZSUzRmNsaWVudF9pZCUzRG9wZW5zaGlmdC13ZWItY29uc29sZSUyNmlkcCUzREF6dXJlJTJCQUQlMjZyZWRpcmVjdF91cmklM0RodHRwcyUyNTNBJTI1MkYlMjUyRm9wZW5zaGlmdC5ta2FyZ2FraS1oY3Aub3NhZGV2LmNsb3VkJTI1MkZjb25zb2xlJTI1MkZvYXV0aCUyNnJlc3BvbnNlX3R5cGUlM0Rjb2RlJTI2c3RhdGUlM0RleUowYUdWdUlqb2lMMk5oZEdGc2IyY2lMQ0p1YjI1alpTSTZJakUxTXpBMk1EZzFOakV4TURZdE16QXpOVFF4T1RreE1USTBOVGcyTXpBMU9EY3pNVFkzTXpNMk5qQXdNalkyTURrNU9EVXhNREl3TkRrMU5qY3lOak14TVRBNE1UVTJNelU0TVRRM09EQTNNRFkwTWpNNU9UQTBPVEkyTVRJaWZR
I0703 09:10:14.903042       1 wrap.go:42] GET /oauth/authorize?client_id=openshift-web-console&idp=Azure+AD&redirect_uri=https%3A%2F%2Fopenshift.mkargaki-hcp.osadev.cloud%2Fconsole%2Foauth&response_type=code&state=eyJ0aGVuIjoiL2NhdGFsb2ciLCJub25jZSI6IjE1MzA2MDg1NjExMDYtMzAzNTQxOTkxMTI0NTg2MzA1ODczMTY3MzM2NjAwMjY2MDk5ODUxMDIwNDk1NjcyNjMxMTA4MTU2MzU4MTQ3ODA3MDY0MjM5OTA0OTI2MTIifQ: (47.525557ms) 302 [[Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0] 10.244.0.9:51326]

Node role label missing from master nodes

$ oc get no master-000000 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2018-07-30T10:32:41Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: Standard_D2s_v3
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eastus
    failure-domain.beta.kubernetes.io/zone: "0"
    kubernetes.io/hostname: master-000000
  name: master-000000
<-- snip -->

whereas it's set correctly on infra and compute nodes

$ oc get no infra-000000 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2018-07-30T10:35:49Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: Standard_D2s_v3
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eastus
    failure-domain.beta.kubernetes.io/zone: "0"
    kubernetes.io/hostname: infra-000000
    node-role.kubernetes.io/infra: "true"
    region: infra
  name: infra-000000
<-- snip -->
$ oc get no compute-000000 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2018-07-30T10:35:49Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: Standard_D2s_v3
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eastus
    failure-domain.beta.kubernetes.io/zone: "0"
    kubernetes.io/hostname: compute-000000
    node-role.kubernetes.io/compute: "true"
    region: primary
  name: compute-000000
<-- snip -->

/kind bug

panic: couldn't find kind ClusterServiceBroker

2018/06/27 18:44:05 Create Secret/openshift-template-service-broker/templateservicebroker-client
panic: couldn't find kind ClusterServiceBroker

goroutine 1 [running]:
main.main()
	/home/kargakis/go/src/github.com/jim-minter/azure-helm/cmd/sync/sync.go:37 +0x1e8
exit status 2

During the addons installation

unable to build cakephp-mysql

After finishing a recent deploy, I attempted to build a cakephp-mysql application from the console. It failed with the following build errors:

The ImageStreamTag "php:7.0" is invalid: from: Error resolving ImageStreamTag php:7.0 in namespace test: imagestreams.image.openshift.io "php" not found

I then did an image import:

$ KUBECONFIG=_data/_out/admin.kubeconfig oc import-image --from=docker.io/library/php:7.0 php:7.0 --confirm
The import completed successfully.

Name:			php
Namespace:		test
Created:		Less than a second ago
Labels:			<none>
Annotations:		openshift.io/image.dockerRepositoryCheck=2018-07-17T18:48:55Z
Docker Pull Spec:	docker-registry.default.svc:5000/test/php
Image Lookup:		local=false
Unique Images:		1
Tags:			1

7.0
  tagged from docker.io/library/php:7.0
...

Then I attempted to start a build:

$ KUBECONFIG=_data/_out/admin.kubeconfig oc start-build cakephp-mysql-persistent          
build "cakephp-mysql-persistent-2" started
$ KUBECONFIG=_data/_out/admin.kubeconfig oc get builds           
NAME                         TYPE      FROM          STATUS                            STARTED              DURATION
cakephp-mysql-persistent-1   Source    Git@53d2216   Failed (PullBuilderImageFailed)   About a minute ago   40s
cakephp-mysql-persistent-2   Source    Git@53d2216   Failed (PullBuilderImageFailed)   32 seconds ago       5s
$ KUBECONFIG=_data/_out/admin.kubeconfig oc get pods
NAME                               READY     STATUS    RESTARTS   AGE
cakephp-mysql-persistent-1-build   0/1       Error     0          1m
cakephp-mysql-persistent-2-build   0/1       Error     0          46s

Fetched the logs:

$ KUBECONFIG=_data/_out/admin.kubeconfig oc logs cakephp-mysql-persistent-2-build
Error from server: no preferred addresses found; known addresses: [{Hostname compute-000000}]

Nodes are up and running:

NAME             STATUS    AGE       VERSION
compute-000000   Ready     4h        v1.10.0+b81c8f8
infra-000000     Ready     4h        v1.10.0+b81c8f8

Pods for this cluster are healthy:

NAME                                      READY     STATUS    RESTARTS   AGE
bootstrap-autoapprover-7dcd664dbf-9xqf8   1/1       Running   10         4h
master-api-bf777d955-f2c54                2/2       Running   0          4h
master-controllers-64b5f48fbc-wprrx       1/1       Running   0          4h
master-etcd-54f698dc5d-v2blb              1/1       Running   0          4h
servicecatalog-api-798569dfdf-z5pqn       1/1       Running   10         4h

Any ideas why this is failing?

etcd operator needs cluster role binding

If we say we want to run one operator per customer (one operator managed one etcd cluster) we will need to create ClusterRole and ClusterRoleBinding for that operator serviceaccount (or default one).

This implies we:

  1. Will have cluster-admin kubeconfig to do so.
  2. We will have to know namespace of HCP in the config when generating.

If we would run one global operator we still need same, but this would be one-time setup for all HCP cluster. Ideas?

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: etcd-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: etcd-operator
subjects:
- kind: ServiceAccount
  name: default
  namespace: "{{ .Config.Namespace }}"

assuming ClusterRole etcd-operator is created already.

[Testing] testing requirements

For full CI/CD we will need to enhance our testing and image resolve capabilities.

We have a few types of OKD related testing.

  1. Images build part of the core origin repo (all in the control-plain)

  2. Images built from individual repositories (all in addons/plugins part)

  3. Any commits to origin repo will need to get job triggered and ONLY control plain images overwritten (list https://github.com/openshift/release/blob/master/ci-operator/config/openshift/origin/master.json)
    This means for our subset these images [1] will have to point to oreg_url:=testregistry/example/ci-pr-as-${component}-${version}
    and all other images have to point to either upstream registry (registry.redhat.io for GA builds test and aws-reg for pre-ga builds.

[1]:

c.ControlPlaneImage = image(c.ImageConfigFormat, "control-plane", v)
c.NodeImage = image(c.ImageConfigFormat, "node", v)
c.RouterImage = image(c.ImageConfigFormat, "haproxy-router", v)

This implies for images like [2] we will need to overwrite to dynamic registry variable (either GA or pre-aws). *I suggest to add back DEV_REGISTRY variable and for 3.11 all other images are pointed to this one. for GA - they point to registry.redhat.io

[2]:

c.ServiceCatalogImage = image(c.ImageConfigFormat, "service-catalog", v)
c.AnsibleServiceBrokerImage = image(c.ImageConfigFormat, "ansible-service-broker", v)
c.TemplateServiceBrokerImage = image(c.ImageConfigFormat, "template-service-broker", v)
c.RegistryImage = image(c.ImageConfigFormat, "docker-registry", v)

The second set of jobs have to point to individual repositories for images like [3] and test them in our future E2E test by:

  1. Installing either GA or Pre-GA stable release cluster.
  2. Inject new images we need to test into sync plugin config and overwrite individual image we want to test.
  3. Profit.

[3]

c.OAuthProxyImage = "registry.access.redhat.com/openshift3/oauth-proxy:" + v
c.PrometheusImage = "registry.access.redhat.com/openshift3/prometheus:" + v
c.PrometheusAlertBufferImage = "registry.access.redhat.com/openshift3/prometheus-alert-buffer:" + v
c.PrometheusAlertManagerImage = "registry.access.redhat.com/openshift3/prometheus-alertmanager:" + v
c.PrometheusNodeExporterImage = "registry.access.redhat.com/openshift3/prometheus-node-exporter:" + v

For for the first step in next few weeks (post GA) we need to rearrange
https://github.com/openshift/openshift-azure/blob/master/pkg/config/generate.go#L87-L90
to be split for core images and individual images and introduce DEV_REGISTRY variable to the equation.
Second individual ci-operator jobs need to be created.
Third e2e tests suit entry point created for:

  1. Run unit tests
  2. Create cluster
  3. Run integration tests
  4. Run smoke tests
  5. Any other tests ?

cc: @openshift/sig-azure for discussion ?

vm-compute-0 can't contact apiserver

reproducer: run oc new-app nginx, deploy pod fails unable to contact apiserver. On the node, curl https://172.31.0.1 hangs. Doesn't seem to happen on vm-infra-0.

Testing pre-release with AKS/HCP

Currently our used aks model cant pull from our pre-release registry in aws:

  1m            1m              1       default-scheduler                                                       Normal          Scheduled               Successfully assigned master-controllers-75c58bd78-cw4s8 to aks-agentpool-94987606-0
  1m            1m              1       kubelet, aks-agentpool-94987606-0                                       Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "default-token-h2l5b"
  1m            1m              1       kubelet, aks-agentpool-94987606-0                                       Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "master-cloud-provider"
  1m            1m              1       kubelet, aks-agentpool-94987606-0                                       Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "master-config"    
  1m            30s             3       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Normal          Pulling                 pulling image "openshift3/ose-control-plane:v3.10.15-1"   
  1m            30s             3       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Warning         Failed                  Failed to pull image "openshift3/ose-control-plane:v3.10.15-1": rpc error: code = Unknown desc = Error response from daemon: repository openshift3/ose-control-plane not found: does not exist or no pull access                                                             
  1m            30s             3       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Warning         Failed                  Error: ErrImagePull                                       
  1m            8s              4       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Normal          BackOff                 Back-off pulling image "openshift3/ose-control-plane:v3.10.15-1"
  1m            8s              4       kubelet, aks-agentpool-94987606-0       spec.containers{controllers}    Warning         Failed                  Error: ImagePullBackOff

so testing cant be done with pre-release rhel.

We need way to inject our pull secrets and define pre-release images with full FQDN when testing rhel :/

create docker registry init container to setup storage account

Currently fails to deploy

panic: azure: malformed storage account key: illegal base64 data at input byte 4

goroutine 1 [running]:
github.com/openshift/image-registry/vendor/github.com/docker/distribution/registry/handlers.NewApp(0x7f25be9e0c40, 0x2215860, 0xc420401880, 0x7f25be9e0c40)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/vendor/github.com/docker/distribution/registry/handlers/app.go:123 +0x341e
github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp(0x218cf00, 0x2215860, 0xc420401880, 0x218cc00, 0xc420203420, 0xc4205158b0)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:89 +0xae
github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp(0x218cf00, 0x2215860, 0x2178140, 0xc4201421a0, 0xc420401880, 0xc420644960, 0x0, 0x0, 0xd7, 0x19b)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:107 +0x328
github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer(0x218cf00, 0x2215860, 0xc420401880, 0xc420644960, 0x0, 0x0, 0x21953e0)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:186 +0x19c
github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute(0x2173780, 0xc420142030)
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:161 +0x9ed
main.main()
	/tmp/openshift/build-rpms/rpm/BUILD/origin-dockerregistry-3.11.0/_output/local/go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:54 +0x37e

sync process shouldn't call Enrich

Enrich is a fix-up function. There should probably be fields in the external API representation for everything that Enrich adds, so that the function itself can be removed. The enriched external API representation can be passed over to the sync pod, and it then shouldn't need environment variables like RESOURCEGROUP to be set.

Upgrade Services fails

2018/07/20 20:29:30 Delete clusterIP, map[annotations:map[prometheus.io/port:1936 prometheus.io/scrape:true prometheus.openshift.io/password:75XmZa2KNV prometheus.openshift.io/username:admin] name:router-stats namespace:default selfLink:/api/v1/namespaces/default/services/router-stats uid:d8328c46-8c4c-11e8-b835-c2b37693715e resourceVersion:2792 creationTimestamp:2018-07-20T18:43:50Z labels:map[router:router-stats]]
2018/07/20 20:29:30 Update Service/default/router-stats
2018/07/20 20:29:30 - Object.map[metadata].map[annotations].map[prometheus.openshift.io/password]: 75XmZa2KNV != vgIdDy3bPG
2018/07/20 20:29:30 Error while syncing: Service "router-stats" is invalid: spec.clusterIP: Invalid value: "": field is immutable
+ KUBECONFIG=_data/_out/admin.kubeconfig

cc @jim-minter looks like handling clusterIP is not only for Type Loadbalancer.

Consider using k8s deployments vs DCs for infra components

default                            docker-registry-1-deploy         0/1       Error              0          1h
default                            registry-console-1-deploy        0/1       Error              0          1h
default                            router-1-deploy                  0/1       Error              0          1h
openshift-ansible-service-broker   asb-1-deploy                     0/1       Error              0          1h

It's unfortunate watching our infra deployment configs fail because the deployer pod never managed to run or failed. These processes do not need the extra APIs provided by DCs and should retry indefinitely.

etcd operator storage, crd and certificate issues

The plan was to use etcd-operators. Now where we struggle.

First, we need to know the namespace when generating certificates. This is because of
https://github.com/coreos/etcd-operator/blob/70d3bd74960dc7127870a393affffbe1df94728e/pkg/util/etcdutil/member.go#L38-L40
The result is that etcd advertises itself with name.namespace.svc and we need to have this in the certificates.

Second (and a little bit bigger on) is storage.
First, etcd-operator online contains multiple misleading docs, examples. So we rely on code.

  1. Etcd pods itself does not have any persistence.
    https://github.com/coreos/etcd-operator/blob/master/pkg/apis/etcd/v1beta2/cluster.go#L137
    Upstream issue:
    coreos/etcd-operator#1323

Idea is we run in memory and backup constantly. In DR situation if a single pod is alive - the operator will recover. If all pods restart - recovery is done using etcd-restore-operator and restore is done from backup.

For this we need etcd-backup and etcd-restore operators.
backup operator supports 2 backup methods (Azure ASB and AWS S3) https://github.com/coreos/etcd-operator/blob/master/pkg/apis/etcd/v1beta2/backup_types.go#L19-L28

Configuration is what causes an issue. We need to have secret with storage account name and key.
https://github.com/coreos/etcd-operator/blob/master/doc/design/abs_backup.md

This means pre-requested are:

  1. Storage account created.
  2. Key available during creation of secret.

We don't want to create a storage account during ARM deployment as is not a client facing configuration and artifact. We could use one storage account with multiple buckets per customer. And inject from the backend.

Last one issue is helm ordering for CRD:
helm/helm#2994
TL;DR: When helm created CRD it takes some time for the cluster to accept them. Creating CRD resources fails as it's not yet available.

In addition, we dont want to manage global CRD's for all users from the user configuration side. If CRD is deleted - all etcd cluster are deleted too. It would look like we need to manage them outside azure-helm as part of HCP management.

cc: @jim-minter @Kargakis @pweil-

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.