Giter Club home page Giter Club logo

fury-kubernetes-logging's People

Contributors

angelbarrera92 avatar beratio avatar emasacco avatar g-iannelli avatar gitirabassi avatar jnardiello avatar lnovara avatar lzecca78 avatar nandajavarma avatar nohant avatar nutellinoit avatar phisco avatar ralgozino avatar reuenthal767 avatar sbruzzese902 avatar smerlos avatar stefanoghinelli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fury-kubernetes-logging's Issues

Elastic Search Split Brain Issues (aka multiple masters)

This issue is related to all Elastic Search versions prior to 7 (so it affects all our fury logging modules to date).
From version 7 onwards this won't be a problem anymore since the cluster coordination module has ben reworked: https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch

ElasticSearch is a very flexible system that can be configured with (nearly) any number of nodes and master nodes. How does ElasticSearch know how many nodes are needed to reach a quorum for master election? By taking into account a setting called minimum_master_nodes.

We can get the current configured setting with the following api call:
GET _cluster/settings?include_defaults&filter_path=**.minimum_master_nodes

This is the result:
"discovery.zen.minimum_master_nodes" : "-1"

"-1" it's the same as 1, which means that in case of a network segmentation or in case of an unfortunate restart of the pod where the current master node lives the ES cluster can end up having 2 masters (split brain), a situation that makes the cluster unusable and that if it drags for too much can create dangling indices and other problems (this happened a lot of times to me). The only way that I know of to recover from a split brain is to scale down the ES sts to 0 replicas and then back again to 3.

To solve this problem we need to make 2 changes:

  • set "discovery.zen.minimum_master_nodes" : "2" for elasticsearch triple
  • rework the readiness probe accordingly (until there are 2 nodes up the health api endpoint does not return success, but if the readiness probe is not returning success the pod won't start accepting traffic and the next pod of the sts won't come up, setting sts podManagementPolicy: "Parallel" won't help either because if the pods cannot accept traffic they won't be able to form a cluster)

Improve infrastructural log management

All the Fury namespaces should have it's own index on opensearch/elasticsearch, and we should not parse log as json. Kubernetes flow and output will exclude all the infra namespace and retain the json parsing feature

[v3.*] ism-policy-cronjob job not updating policies

While checking an anomaly with @FedericoAntoniazzi , we found out that ism-policy-cronjob correctly creates new policies but it is failing to update existing ones, throwing also an error during its execution.

This is due to the lack of two parameters (seq_no and primary_term), which are required when updating policies.

So to have a perfect idempotent job, it should probably check if a policy already exists with a GET request, which returns the aforementioned parameters, and then perform a PUT, with a correct querystring.

References

Update policy
Get policy

[v3.*] set default replicas for system indexes

Hi ๐Ÿ‘‹

With @Deepzima and @ettoreciarcia we discovered that by default all system indexes in OpenSearch (in the form .opendistro*) have by default the replicas set at 1.

This causes all our single-node OpenSearch deploy to be in the yellow state.

We fixed the issue using the following strategy:

  • deleted the unassigned shards ( in our case they were .opendistro-ism-config and .opendistro-job-scheduler-lock
  • apply the following patch as suggested here and after that launch manually the ism-policy job
curl -H 'Content-Type: application/json' -X PUT \
    -d '{
  "priority": 0,
  "index_patterns": [
    ".*"
  ],
  "template": {
  "settings": {
    "index": {
      "auto_expand_replicas": "0-1"
    }
  }
  }
}' \
  http://<opensearch_url>/_index_template/auto_expand_replicas

Probably this should be set together with our index patterns using the cronjob or - better - from the OpenSearch configuration file, if possible.

Problem with Kibana liveness probe

We have a pretty complex readinessProbe in Kibana: https://github.com/sighupio/fury-kubernetes-logging/blob/master/katalog/kibana/kibana.yml#L49

I'm facing some issues with it, kibana is not ready, but it's ok if i try to port-forward to it.

The problem is that we are trying to put automatically the index patterns, but this causes kibana to ask elasticsearch, and this way we have a readiness that is dependent to another service

Can we remove or rethink this readinessProbe?

Logging operator's fluentd instances not printing logs to stdout

Problem

By default, the logging operator configures Fluentd to print logs in a file inside the container.

When doing troubleshooting activities, those logs are not available in Kibana/Loki nor by querying them with kubectl logs or stern. The only way is to use kubectl exec to enter inside the pod and print the file to stdout. e.g.:

kubectl exec -it logging-demo-fluentd-0 cat /fluentd/log/out

Proposed solution

As said in the documentation, we should add the stdout filter in Flow and ClusterFlow resources :

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
  name: kubernetes
  namespace: logging
spec:
  filters:
  - stdout: {}

Additional information

  • Affected versions: >= 2.0.0

References

Some important cerebro metrics are not container aware.

Metrics exposed by cerebro such as:

  • process cpu % (nodes tab)
  • os cpu (nodes tab)
  • ram.percent (cat apis/nodes)

are not container aware and thus highly misleading (I've digged into the angularjs and play scala source code to confirm this).

Another problem is that there is no way of viewing any kind of historical data regarding metrics exposed by cerebro.

My proposed solution is to enable kibana xpack monitoring: "xpack.monitoring.collection.enabled" : "true" with container aware metrics: "xpack.monitoring.ui.container.elasticsearch.enabled" : "true" (the second setting should be set to true by default in the ES docker container)

This change will enable the "Monitoring" tab in Kibana and will create 2 special indices with 1shard and 1replica:

  • .monitoring-es-6-yyyy.mm.dd
  • .monitoring-kibana-6-yyyy.mm.dd

This indices come with an Index Lifecycle Policy (ILM) attached and are automatically deleted after 7 days.
In the basic license version you cannot set a retention that exceed 7 days.
If lower retention is needed a curator job that takes in consideration that there is already an ILM attached to the indices will need to be created.

Improve nginx log parse

With the standard nginx parser, there is a lot of parsing error when parsing logs from nginx ingress controller. We need to fix/update this behaviour

[v3.*] time field discarded by CRI parser

Hello!

the Fluentbit version, which is 1.9.10, currently set in logging-operated is still present the following bug.
Simply put, the CRI parser will ignore and not output time field if present: this will end in the creation of an index in Opensearch with a name including a suffix equal to 1970-01-01.

The problem was solved by this commit, which was released in Fluentbit 2.0.7.

Make fluentd cloudwatch compatible

In some cases we need to send log to cloudwatch from fluentd.
This feature is missing in our images since quay.io/sighup/fluentd:v1.11.5-debian-1.0.
Do you think it will be possible to have it back for the latest modules of logging v1 and v2?

Problems on rendering the correct datasource on Logging Dashboard

Dear team,
I've opened this issue with @nutellinoit just to track this bug that hit the Logging Dash

Briefing:
I've installed this logging module at the version 2.0.0 on a eks cluster(
v1.19.15) that's have the FD at v1.5.1, In follow the list of all subv of all components:

versions:
  monitoring: v1.11.1
  logging: v2.0.0
  ingress: v1.9.1

Describe the bug:
When I try to use the new Logging Dashboard, all queries(widgets) throw a message error. Probably the default source in the dashboard schema is incorrect for this version of the monitoring module.
In follow a screen of the error:
Screenshot 2022-04-14 at 18 59 05

Add alerts for elasticsearch cluster health checks

Sometimes curator's Cronjob hangs because of the es cluster not being healthy, ti would be useful to know it.
The issue arose due to a momentary networking issue, that brought the three es instances not to form a cluster even after the networking went back to normal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.