sighupio / fury-kubernetes-logging Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 6.0 3.19 MB

Kubernetes Fury Distribution Logging Core Module: centralized logging for your Kubernetes Cluster

License: BSD 3-Clause "New" or "Revised" License

Shell 41.27% Python 24.78% Lua 0.29% Makefile 26.52% Dockerfile 7.14%

fury-kubernetes-logging's People

Contributors

Stargazers

Watchers

Forkers

miroslavvavra laashub-soa devopstoday11 nandajavarma alexrogalskiy rajesh-shanmugasundaram8

fury-kubernetes-logging's Issues

Update and make Loki Production Grade and test it with k8s 1.25

The current provided Loki package configuration is not production-grade, adapt the package to be production-grade.

Elastic Search Split Brain Issues (aka multiple masters)

This issue is related to all Elastic Search versions prior to 7 (so it affects all our fury logging modules to date).
From version 7 onwards this won't be a problem anymore since the cluster coordination module has ben reworked: https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch

ElasticSearch is a very flexible system that can be configured with (nearly) any number of nodes and master nodes. How does ElasticSearch know how many nodes are needed to reach a quorum for master election? By taking into account a setting called minimum_master_nodes.

We can get the current configured setting with the following api call:
GET _cluster/settings?include_defaults&filter_path=**.minimum_master_nodes

This is the result:
"discovery.zen.minimum_master_nodes" : "-1"

"-1" it's the same as 1, which means that in case of a network segmentation or in case of an unfortunate restart of the pod where the current master node lives the ES cluster can end up having 2 masters (split brain), a situation that makes the cluster unusable and that if it drags for too much can create dangling indices and other problems (this happened a lot of times to me). The only way that I know of to recover from a split brain is to scale down the ES sts to 0 replicas and then back again to 3.

To solve this problem we need to make 2 changes:

set "discovery.zen.minimum_master_nodes" : "2" for elasticsearch triple
rework the readiness probe accordingly (until there are 2 nodes up the health api endpoint does not return success, but if the readiness probe is not returning success the pod won't start accepting traffic and the next pod of the sts won't come up, setting sts podManagementPolicy: "Parallel" won't help either because if the pods cannot accept traffic they won't be able to form a cluster)

Check the opensearch/elasticsearch init container for the max open files

65536 open file is too low, we need to check and change the value to an higher ceiling or not define that value altogether

Improve infrastructural log management

All the Fury namespaces should have it's own index on opensearch/elasticsearch, and we should not parse log as json. Kubernetes flow and output will exclude all the infra namespace and retain the json parsing feature

Kibana livenessProbe missing

Hi guys,

it seems like we're missing the livenessProbe definition for Kibana.

Thanks.
Bernard

Incompatibility between logging operated minio instance and velero-on-prem dr module - v2.0.0

Hi team, i found another incompatibility, when using logging-operated together with velero-on-prem dr package, kustomize throws:

Error: accumulating resources: recursed merging from path '.....manifests/vendor/katalog/dr/velero/velero-on-prem': var 'MINIO_NAMESPACE' already encountered

[v3.*] ism-policy-cronjob job not updating policies

While checking an anomaly with @FedericoAntoniazzi , we found out that ism-policy-cronjob correctly creates new policies but it is failing to update existing ones, throwing also an error during its execution.

This is due to the lack of two parameters (seq_no and primary_term), which are required when updating policies.

So to have a perfect idempotent job, it should probably check if a policy already exists with a GET request, which returns the aforementioned parameters, and then perform a PUT, with a correct querystring.

References

Update policy
Get policy

[v3.*] set default replicas for system indexes

Hi 👋

With @Deepzima and @ettoreciarcia we discovered that by default all system indexes in OpenSearch (in the form .opendistro*) have by default the replicas set at 1.

This causes all our single-node OpenSearch deploy to be in the yellow state.

We fixed the issue using the following strategy:

deleted the unassigned shards ( in our case they were .opendistro-ism-config and .opendistro-job-scheduler-lock
apply the following patch as suggested here and after that launch manually the ism-policy job

curl -H 'Content-Type: application/json' -X PUT \
    -d '{
  "priority": 0,
  "index_patterns": [
    ".*"
  ],
  "template": {
  "settings": {
    "index": {
      "auto_expand_replicas": "0-1"
    }
  }
  }
}' \
  http://<opensearch_url>/_index_template/auto_expand_replicas

Probably this should be set together with our index patterns using the cronjob or - better - from the OpenSearch configuration file, if possible.

Problem with Kibana liveness probe

We have a pretty complex readinessProbe in Kibana: https://github.com/sighupio/fury-kubernetes-logging/blob/master/katalog/kibana/kibana.yml#L49

I'm facing some issues with it, kibana is not ready, but it's ok if i try to port-forward to it.

The problem is that we are trying to put automatically the index patterns, but this causes kibana to ask elasticsearch, and this way we have a readiness that is dependent to another service

Can we remove or rethink this readinessProbe?

[v3.0.x] Specify in the release page that `Loki` is a technical preview

On v3.0.1 is not clear that loki-stack is a technical preview

On file rotation, fluent-bit stops reading logs - v2.0.0

Hi team, i found a nasty bug on fluent-bit 1.18.13, when a file rotates fluent-bit has the wrong hash and the new file is not added to the tailing

The problem is fixed in the next patch version, see:

Add monitoring on Kibana package release-v2.x

To improve monitoring on packages, we need to add some prometheus alerts on Kibana package

Add livenessProbe to fluentd DaemonSet

fluentd is lacking a proper livenessProbe.

See

for implementation ideas.

Logging operator's fluentd instances not printing logs to stdout

Problem

By default, the logging operator configures Fluentd to print logs in a file inside the container.

When doing troubleshooting activities, those logs are not available in Kibana/Loki nor by querying them with kubectl logs or stern. The only way is to use kubectl exec to enter inside the pod and print the file to stdout. e.g.:

kubectl exec -it logging-demo-fluentd-0 cat /fluentd/log/out

Proposed solution

As said in the documentation, we should add the stdout filter in Flow and ClusterFlow resources :

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
  name: kubernetes
  namespace: logging
spec:
  filters:
  - stdout: {}

Additional information

Affected versions: >= 2.0.0

References

Some important cerebro metrics are not container aware.

Metrics exposed by cerebro such as:

process cpu % (nodes tab)
os cpu (nodes tab)
ram.percent (cat apis/nodes)

are not container aware and thus highly misleading (I've digged into the angularjs and play scala source code to confirm this).

Another problem is that there is no way of viewing any kind of historical data regarding metrics exposed by cerebro.

My proposed solution is to enable kibana xpack monitoring: "xpack.monitoring.collection.enabled" : "true" with container aware metrics: "xpack.monitoring.ui.container.elasticsearch.enabled" : "true" (the second setting should be set to true by default in the ES docker container)

This change will enable the "Monitoring" tab in Kibana and will create 2 special indices with 1shard and 1replica:

.monitoring-es-6-yyyy.mm.dd
.monitoring-kibana-6-yyyy.mm.dd

This indices come with an Index Lifecycle Policy (ILM) attached and are automatically deleted after 7 days.
In the basic license version you cannot set a retention that exceed 7 days.
If lower retention is needed a curator job that takes in consideration that there is already an ILM attached to the indices will need to be created.

Backport #115 to elasticsearch on release-v2.0 branch

When the PR #115 is accepted, we should backport this change to the release-v2.0 branch

Improve nginx log parse

With the standard nginx parser, there is a lot of parsing error when parsing logs from nginx ingress controller. We need to fix/update this behaviour

[v3.0.0] Missing cleanup of fluentd and fluentbit on upgrade provedure

On https://github.com/sighupio/fury-kubernetes-logging/releases/tag/v3.0.0 release description, under the upgrade guide, it's missing the reference to delete the fluentd StatefulSet and the fluentbit DaemonSet

Update opensearch (and opensearch dashboards to the latest stable version, test it with K8s 1.25

[v3.*] time field discarded by CRI parser

Hello!

the Fluentbit version, which is 1.9.10, currently set in logging-operated is still present the following bug.
Simply put, the CRI parser will ignore and not output time field if present: this will end in the creation of an index in Opensearch with a name including a suffix equal to 1970-01-01.

The problem was solved by this commit, which was released in Fluentbit 2.0.7.

Grafana monitoring: exceeded max resolution 11000 points in timeseries

Hello, when I select a time interval larger than 2 days in Grafana, I get the following error.

Is there a way to handle at least a week of data (e.g. we would like to monitor one week of ingress traffic)?

Thank you!

audit hosttailer wrong node selector for 1.24

New 1.24 kubernetes version removes the master role in favor of control-plane role.

Audit HostTailer should be updated accordingly

https://github.com/sighupio/fury-kubernetes-logging/blob/master/katalog/configs/audit/audit-hosttailer.yml#L21

No dual ingress compatibility on ingress flow - v2.0.0

The compatibility with the dual ingress is missing in the module, thanks @sbruzzese902 for finding it

Make fluentd cloudwatch compatible

In some cases we need to send log to cloudwatch from fluentd.
This feature is missing in our images since quay.io/sighup/fluentd:v1.11.5-debian-1.0.
Do you think it will be possible to have it back for the latest modules of logging v1 and v2?

[v3.0.1] deprecated CONTRIBUTING docs

Please update the CONTRIBUTING docs for version >= 3.0.0 because ElasticSearch has been deprecated.

Update logging operator to the latest stable version, test it with k8s 1.25

Release Logging v3 with OpenSearch

Verify that all the tasks for the version v3 are done, check release notes and create a PR for releasing the new version

Update the main diagram with the new architecture

The diagram should contain logging operator and opensearch + loki outputs

Problems on rendering the correct datasource on Logging Dashboard

Dear team,
I've opened this issue with @nutellinoit just to track this bug that hit the Logging Dash

Briefing:
I've installed this logging module at the version 2.0.0 on a eks cluster(
v1.19.15) that's have the FD at v1.5.1, In follow the list of all subv of all components:

versions:
  monitoring: v1.11.1
  logging: v2.0.0
  ingress: v1.9.1

Describe the bug:
When I try to use the new Logging Dashboard, all queries(widgets) throw a message error. Probably the default source in the dashboard schema is incorrect for this version of the monitoring module.
In follow a screen of the error:

Add alerts for elasticsearch cluster health checks

Sometimes curator's Cronjob hangs because of the es cluster not being healthy, ti would be useful to know it.
The issue arose due to a momentary networking issue, that brought the three es instances not to form a cluster even after the networking went back to normal.

If the minio instance for the error logs goes down fluentd stop sending logs - v2.0.0

Hi team, in the event that the new minio instance for the error output log gathering goes down, fluentd stops sending logs.

We must find a workaround, or use a different strategy for the error output

sighupio / fury-kubernetes-logging Goto Github PK

fury-kubernetes-logging's People

Contributors

Stargazers

Watchers

Forkers

fury-kubernetes-logging's Issues

References

Problem

Proposed solution

Additional information

References

Recommend Projects

Recommend Topics

Recommend Org