qlik-oss / core-scaling Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 2.0 58.26 MB

Qlik Core use case for auto-scaling in Kubernetes.

License: Other

Shell 55.37% JavaScript 17.32% HTML 27.32%

blackduck docker dockerbump engine kubernetes qix scaling

core-scaling's People

Contributors

Stargazers

Watchers

Forkers

yianni-ververis adeelkhan1

core-scaling's Issues

HPA killing pods without pre-hook being executed

When scaling down a cluster it seems like the pre-hook is not always being executed on all nodes before being terminated. This problem was seen when doing some load testing on a 8-node cluster in GKE.

The problem occured when the load was decreasing, and a TERMINATING flag was put on one of the engine nodes. The pre-hook kicked in and the node stayed alive as long as there were sessions on that node. However the HPA tried to scale down once more during this period, and then two more engine nodes were killed instantly without receiving any TERMINATING flag. This would lead to a lot of sessions being killed without having a grace period.

This might be a problem with the HPA logic, but needs to investigate if we can mitigate the problem or file a bug on HPA.

Add CI

We are using jq in the tutorial but we do not list it as an requirement

Should we list jq as an requirement? Should we remove it? Do something else?

Add testing before deploying a new version on our live environment

nginx-ingress-controller v0.26.1 is not working with current ingress config

Currently the core-scaling cluster is using the latest stable version of the nginx-ingress-controller. The latest stable version is v0.26.1 which does not seem to work with our ingress config, however v0.25.1 is working fine.

According to the logs it is related to the X-Real-IP header, but needs further investigation. For now we should hardcode the ingress-controller version to v0.25.1.

Error logs on fluentd-gcp-scaler every minute

{
    "textPayload": "error: info: {extensions v1beta1 daemonsets} \"fluentd-gcp-v3.1.0\" was not changed\n",
    "insertId": "27xggrg2zjb3vk",
    "resource": {
        "type": "container",
        "labels": {
            "pod_id": "fluentd-gcp-scaler-697b966945-c5rlm",
            "zone": "europe-west3-a",
            "project_id": "dev-prod-qlik-core",
            "cluster_name": "core-scaling",
            "container_name": "fluentd-gcp-scaler",
            "namespace_id": "kube-system",
            "instance_id": "926775557221960039"
        }
    },
    "timestamp": "2018-12-27T11:31:20Z",
    "severity": "ERROR",
    "labels": {
        "compute.googleapis.com/resource_name": "fluentd-gcp-v3.1.0-qnl66",
        "container.googleapis.com/pod_name": "fluentd-gcp-scaler-697b966945-c5rlm",
        "container.googleapis.com/stream": "stderr",
        "container.googleapis.com/namespace_name": "kube-system"
    },
    "logName": "projects/dev-prod-qlik-core/logs/fluentd-gcp-scaler",
    "receiveTimestamp": "2018-12-27T11:31:23.384932729Z"
}

Downscaling of Machines does not seem to happen

When putting some load I can see that the pods and machines are scaled up. When removing the load the pods are scaled down but the machines are still up

Adapt to new license key format

Remove serial and control code in favour of the new license key format.

Change backing volume in GKE

Instead of using compute disk we should change to use Cloud Filestore volume.
Then we can use the stable nfs-client helm chart.

Issues with running commands in bash for windows

When querying Kubernetes in Bash for Windows, there is an issue with the paths. Can be solved by using the ordinary cmd in Windows.
See if we can fix it for Bash for Windows.

Bump cert-manager to 0.8.0 or greater

Due to an issue in certmanager an upgrade needs to be done before 1 November. See below:

Let's encrypt has been working with Jetstack, the authors of cert-manager, on a
series of fixes to the client. Cert-manager sometimes falls into a
traffic pattern where it sends really excessive traffic to Let's
Encrypt's servers, continuously. To mitigate this, we plan to start
blocking all traffic from cert-manager versions less than 0.8.0 (the
current semver minor release), as of November 1, 2019. Please upgrade
all of your cert-manager instances before then.

Convert to Helm charts

If we really need to supply Kubernetes vanilla, we might be able to use something like kubecrt instead.

Should we only expose https?

Right now we are exposing both http and https. Should we limit this to https

Remove test cluster if build fails

If we create a GKE cluster during test on CCI, and it fails, the remove cluster will not run.

Opening up the correct dashboard directly in Grafana.

Today we tell our uses to go to localhost:3000 after port-forwarding to grafana. And from there they have to click to find the dashboard. We could add a link that goes directly to the dashboard as we have done in core-using-licenses

Change how we demonstrate how to put load on the engine

We have updated our session-workout tool and need to change the documentation to match these changes.

We might also need to create a load test scenario in the session-workout tool.

Make one script with different targets

Autoupdate certs on live version

Currently the certs are updated automatically, but Nginx does (probably) not pick them up when updated, but instead needs to be restarted.

This should be fixed so we do not have to do manual work for keeping our backend secure for our users

The pvc discs are not removed in gcp

When we run CCI the pvc discs are not removed.

Create a deploy pipeline branch

In order to keep master a running example, we should create a new branch which holds our deployment specifik information. Today we are deploying cert-manager,and oauth that's not in the example.

Huge log entries on custom metrics apiserver in prod

Looks like a huge curl response body is included in an error response for this service, causing log spam with ~100kb log entries. This needs to be fixed before we enable auto-scaling again.

Expand the oauth2 capabilities

We should shift from using only Github to use Auth0 which in turn can use alot of providers.

Clean up of logs in stackdriver

We have alot of logs saved in stackdriver. We need to investigate how and what to clean up.

Issue running ./run.sh deploy

After the cluster has been created and the doc seed should be done we get the following error:

Waiting for deployment to run
error: unable to upgrade connection: container not found ("nfs-server")

Move the readme when ready to the docs

Automated: Protex scan failed

A BlackDuck Protex scan failed on this repository.

Number of files needing identification: 8

If you are unsure how this process works, please contact trunk team on Slack.

No more automated issues will be created on this repository until this issue
is closed.

Please prioritize and solve this as soon as possible. If the files
has been identified already, you may ignore and close this issue.

Add custom alerting on pods killed

Add in our cluster a custom alert for handling if pods get killed.

Builds are failing (too long without input)

Builds are failing both on master and in the current PR:s, we need to investigate and fix this.

Engine metrics scraped on wrong port

We get a lot of HTTP: Read error : Last write time for /usr/local/Client/metrics failed: LastWriteTimeException /usr/local/Client/metrics failed: boost::filesystem::last_write_time: No such file or directory: "/usr/local/Client/metrics": No such file or directory entries right now which is caused by something trying to scrape engine /metrics on port 9076, instead of the 9090 metrics port.

Update yamls to follow k8s recommendations

Running the tool zegl/kube-score will print a list of recommendations / best practices.
Evaluate and make changes acordingly.
Output when checking the Qlik-core yamls: