qlik-oss / core-scaling Goto Github PK
View Code? Open in Web Editor NEWQlik Core use case for auto-scaling in Kubernetes.
License: Other
Qlik Core use case for auto-scaling in Kubernetes.
License: Other
When scaling down a cluster it seems like the pre-hook is not always being executed on all nodes before being terminated. This problem was seen when doing some load testing on a 8-node cluster in GKE.
The problem occured when the load was decreasing, and a TERMINATING
flag was put on one of the engine nodes. The pre-hook kicked in and the node stayed alive as long as there were sessions on that node. However the HPA tried to scale down once more during this period, and then two more engine nodes were killed instantly without receiving any TERMINATING
flag. This would lead to a lot of sessions being killed without having a grace period.
This might be a problem with the HPA logic, but needs to investigate if we can mitigate the problem or file a bug on HPA.
Should we list jq
as an requirement? Should we remove it? Do something else?
Currently the core-scaling
cluster is using the latest stable version of the nginx-ingress-controller
. The latest stable version is v0.26.1
which does not seem to work with our ingress config, however v0.25.1
is working fine.
According to the logs it is related to the X-Real-IP
header, but needs further investigation. For now we should hardcode the ingress-controller version to v0.25.1
.
{
"textPayload": "error: info: {extensions v1beta1 daemonsets} \"fluentd-gcp-v3.1.0\" was not changed\n",
"insertId": "27xggrg2zjb3vk",
"resource": {
"type": "container",
"labels": {
"pod_id": "fluentd-gcp-scaler-697b966945-c5rlm",
"zone": "europe-west3-a",
"project_id": "dev-prod-qlik-core",
"cluster_name": "core-scaling",
"container_name": "fluentd-gcp-scaler",
"namespace_id": "kube-system",
"instance_id": "926775557221960039"
}
},
"timestamp": "2018-12-27T11:31:20Z",
"severity": "ERROR",
"labels": {
"compute.googleapis.com/resource_name": "fluentd-gcp-v3.1.0-qnl66",
"container.googleapis.com/pod_name": "fluentd-gcp-scaler-697b966945-c5rlm",
"container.googleapis.com/stream": "stderr",
"container.googleapis.com/namespace_name": "kube-system"
},
"logName": "projects/dev-prod-qlik-core/logs/fluentd-gcp-scaler",
"receiveTimestamp": "2018-12-27T11:31:23.384932729Z"
}
When putting some load I can see that the pods and machines are scaled up. When removing the load the pods are scaled down but the machines are still up
Remove serial and control code in favour of the new license key format.
Instead of using compute disk we should change to use Cloud Filestore volume.
Then we can use the stable nfs-client helm chart.
When querying Kubernetes in Bash for Windows, there is an issue with the paths. Can be solved by using the ordinary cmd in Windows.
See if we can fix it for Bash for Windows.
Due to an issue in certmanager an upgrade needs to be done before 1 November. See below:
Let's encrypt has been working with Jetstack, the authors of cert-manager, on a
series of fixes to the client. Cert-manager sometimes falls into a
traffic pattern where it sends really excessive traffic to Let's
Encrypt's servers, continuously. To mitigate this, we plan to start
blocking all traffic from cert-manager versions less than 0.8.0 (the
current semver minor release), as of November 1, 2019. Please upgrade
all of your cert-manager instances before then.
If we really need to supply Kubernetes vanilla, we might be able to use something like kubecrt instead.
Right now we are exposing both http and https. Should we limit this to https
If we create a GKE cluster during test on CCI, and it fails, the remove cluster will not run.
Today we tell our uses to go to localhost:3000
after port-forwarding to grafana. And from there they have to click to find the dashboard. We could add a link that goes directly to the dashboard as we have done in core-using-licenses
We have updated our session-workout tool and need to change the documentation to match these changes.
We might also need to create a load test scenario in the session-workout tool.
Currently the certs are updated automatically, but Nginx does (probably) not pick them up when updated, but instead needs to be restarted.
This should be fixed so we do not have to do manual work for keeping our backend secure for our users
When we run CCI the pvc discs are not removed.
In order to keep master a running example, we should create a new branch which holds our deployment specifik information. Today we are deploying cert-manager,and oauth that's not in the example.
Looks like a huge curl response body is included in an error response for this service, causing log spam with ~100kb log entries. This needs to be fixed before we enable auto-scaling again.
We should shift from using only Github to use Auth0 which in turn can use alot of providers.
We have alot of logs saved in stackdriver. We need to investigate how and what to clean up.
After the cluster has been created and the doc seed should be done we get the following error:
Waiting for deployment to run
error: unable to upgrade connection: container not found ("nfs-server")
A BlackDuck Protex scan failed on this repository.
Number of files needing identification: 8
If you are unsure how this process works, please contact trunk team on Slack.
No more automated issues will be created on this repository until this issue
is closed.
Please prioritize and solve this as soon as possible. If the files
has been identified already, you may ignore and close this issue.
Add in our cluster a custom alert for handling if pods get killed.
Builds are failing both on master and in the current PR:s, we need to investigate and fix this.
We get a lot of HTTP: Read error : Last write time for /usr/local/Client/metrics failed: LastWriteTimeException /usr/local/Client/metrics failed: boost::filesystem::last_write_time: No such file or directory: "/usr/local/Client/metrics": No such file or directory
entries right now which is caused by something trying to scrape engine /metrics
on port 9076, instead of the 9090 metrics port.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.