Giter Club home page Giter Club logo

core-scaling's People

Contributors

carlioth avatar dependabot[bot] avatar fredrikfolkesson avatar gabbaxx avatar peol avatar qlikossbuild avatar renovate[bot] avatar sublibra avatar wennmo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

core-scaling's Issues

HPA killing pods without pre-hook being executed

When scaling down a cluster it seems like the pre-hook is not always being executed on all nodes before being terminated. This problem was seen when doing some load testing on a 8-node cluster in GKE.

The problem occured when the load was decreasing, and a TERMINATING flag was put on one of the engine nodes. The pre-hook kicked in and the node stayed alive as long as there were sessions on that node. However the HPA tried to scale down once more during this period, and then two more engine nodes were killed instantly without receiving any TERMINATING flag. This would lead to a lot of sessions being killed without having a grace period.

This might be a problem with the HPA logic, but needs to investigate if we can mitigate the problem or file a bug on HPA.

nginx-ingress-controller v0.26.1 is not working with current ingress config

Currently the core-scaling cluster is using the latest stable version of the nginx-ingress-controller. The latest stable version is v0.26.1 which does not seem to work with our ingress config, however v0.25.1 is working fine.

According to the logs it is related to the X-Real-IP header, but needs further investigation. For now we should hardcode the ingress-controller version to v0.25.1.

Error logs on fluentd-gcp-scaler every minute

{
    "textPayload": "error: info: {extensions v1beta1 daemonsets} \"fluentd-gcp-v3.1.0\" was not changed\n",
    "insertId": "27xggrg2zjb3vk",
    "resource": {
        "type": "container",
        "labels": {
            "pod_id": "fluentd-gcp-scaler-697b966945-c5rlm",
            "zone": "europe-west3-a",
            "project_id": "dev-prod-qlik-core",
            "cluster_name": "core-scaling",
            "container_name": "fluentd-gcp-scaler",
            "namespace_id": "kube-system",
            "instance_id": "926775557221960039"
        }
    },
    "timestamp": "2018-12-27T11:31:20Z",
    "severity": "ERROR",
    "labels": {
        "compute.googleapis.com/resource_name": "fluentd-gcp-v3.1.0-qnl66",
        "container.googleapis.com/pod_name": "fluentd-gcp-scaler-697b966945-c5rlm",
        "container.googleapis.com/stream": "stderr",
        "container.googleapis.com/namespace_name": "kube-system"
    },
    "logName": "projects/dev-prod-qlik-core/logs/fluentd-gcp-scaler",
    "receiveTimestamp": "2018-12-27T11:31:23.384932729Z"
}

Change backing volume in GKE

Instead of using compute disk we should change to use Cloud Filestore volume.
Then we can use the stable nfs-client helm chart.

Bump cert-manager to 0.8.0 or greater

Due to an issue in certmanager an upgrade needs to be done before 1 November. See below:

Let's encrypt has been working with Jetstack, the authors of cert-manager, on a
series of fixes to the client. Cert-manager sometimes falls into a
traffic pattern where it sends really excessive traffic to Let's
Encrypt's servers, continuously. To mitigate this, we plan to start
blocking all traffic from cert-manager versions less than 0.8.0 (the
current semver minor release), as of November 1, 2019. Please upgrade
all of your cert-manager instances before then.

Opening up the correct dashboard directly in Grafana.

Today we tell our uses to go to localhost:3000 after port-forwarding to grafana. And from there they have to click to find the dashboard. We could add a link that goes directly to the dashboard as we have done in core-using-licenses

Autoupdate certs on live version

Currently the certs are updated automatically, but Nginx does (probably) not pick them up when updated, but instead needs to be restarted.

This should be fixed so we do not have to do manual work for keeping our backend secure for our users

Create a deploy pipeline branch

In order to keep master a running example, we should create a new branch which holds our deployment specifik information. Today we are deploying cert-manager,and oauth that's not in the example.

Issue running ./run.sh deploy

After the cluster has been created and the doc seed should be done we get the following error:

Waiting for deployment to run
error: unable to upgrade connection: container not found ("nfs-server")

Automated: Protex scan failed

A BlackDuck Protex scan failed on this repository.

Number of files needing identification: 8

If you are unsure how this process works, please contact trunk team on Slack.

No more automated issues will be created on this repository until this issue
is closed.

Please prioritize and solve this as soon as possible. If the files
has been identified already, you may ignore and close this issue.

Engine metrics scraped on wrong port

We get a lot of HTTP: Read error : Last write time for /usr/local/Client/metrics failed: LastWriteTimeException /usr/local/Client/metrics failed: boost::filesystem::last_write_time: No such file or directory: "/usr/local/Client/metrics": No such file or directory entries right now which is caused by something trying to scrape engine /metrics on port 9076, instead of the 9090 metrics port.

Update yamls to follow k8s recommendations

Running the tool zegl/kube-score will print a list of recommendations / best practices.
Evaluate and make changes acordingly.
Output when checking the Qlik-core yamls:

screen shot 2018-12-02 at 11 12 51

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.