sighupio / fury-kubernetes-logging Goto Github PK
View Code? Open in Web Editor NEWKubernetes Fury Distribution Logging Core Module: centralized logging for your Kubernetes Cluster
License: BSD 3-Clause "New" or "Revised" License
Kubernetes Fury Distribution Logging Core Module: centralized logging for your Kubernetes Cluster
License: BSD 3-Clause "New" or "Revised" License
The current provided Loki package configuration is not production-grade, adapt the package to be production-grade.
This issue is related to all Elastic Search versions prior to 7 (so it affects all our fury logging modules to date).
From version 7 onwards this won't be a problem anymore since the cluster coordination module has ben reworked: https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch
ElasticSearch is a very flexible system that can be configured with (nearly) any number of nodes and master nodes. How does ElasticSearch know how many nodes are needed to reach a quorum for master election? By taking into account a setting called minimum_master_nodes
.
We can get the current configured setting with the following api call:
GET _cluster/settings?include_defaults&filter_path=**.minimum_master_nodes
This is the result:
"discovery.zen.minimum_master_nodes" : "-1"
"-1" it's the same as 1, which means that in case of a network segmentation or in case of an unfortunate restart of the pod where the current master node lives the ES cluster can end up having 2 masters (split brain), a situation that makes the cluster unusable and that if it drags for too much can create dangling indices and other problems (this happened a lot of times to me). The only way that I know of to recover from a split brain is to scale down the ES sts
to 0 replicas and then back again to 3.
To solve this problem we need to make 2 changes:
"discovery.zen.minimum_master_nodes" : "2"
for elasticsearch triple
sts
won't come up, setting sts
podManagementPolicy: "Parallel"
won't help either because if the pods cannot accept traffic they won't be able to form a cluster)65536 open file is too low, we need to check and change the value to an higher ceiling or not define that value altogether
All the Fury namespaces should have it's own index on opensearch/elasticsearch, and we should not parse log as json. Kubernetes flow and output will exclude all the infra namespace and retain the json parsing feature
Hi guys,
it seems like we're missing the livenessProbe definition for Kibana.
Thanks.
Bernard
Hi team, i found another incompatibility, when using logging-operated
together with velero-on-prem
dr package, kustomize throws:
Error: accumulating resources: recursed merging from path '.....manifests/vendor/katalog/dr/velero/velero-on-prem': var 'MINIO_NAMESPACE' already encountered
While checking an anomaly with @FedericoAntoniazzi , we found out that ism-policy-cronjob correctly creates new policies but it is failing to update existing ones, throwing also an error during its execution.
This is due to the lack of two parameters (seq_no
and primary_term
), which are required when updating policies.
So to have a perfect idempotent job, it should probably check if a policy already exists with a GET request, which returns the aforementioned parameters, and then perform a PUT, with a correct querystring.
Hi ๐
With @Deepzima and @ettoreciarcia we discovered that by default all system indexes in OpenSearch (in the form .opendistro*
) have by default the replicas set at 1.
This causes all our single-node OpenSearch deploy to be in the yellow state.
We fixed the issue using the following strategy:
.opendistro-ism-config
and .opendistro-job-scheduler-lock
curl -H 'Content-Type: application/json' -X PUT \
-d '{
"priority": 0,
"index_patterns": [
".*"
],
"template": {
"settings": {
"index": {
"auto_expand_replicas": "0-1"
}
}
}
}' \
http://<opensearch_url>/_index_template/auto_expand_replicas
Probably this should be set together with our index patterns using the cronjob or - better - from the OpenSearch configuration file, if possible.
We have a pretty complex readinessProbe in Kibana: https://github.com/sighupio/fury-kubernetes-logging/blob/master/katalog/kibana/kibana.yml#L49
I'm facing some issues with it, kibana is not ready, but it's ok if i try to port-forward to it.
The problem is that we are trying to put automatically the index patterns, but this causes kibana to ask elasticsearch, and this way we have a readiness that is dependent to another service
Can we remove or rethink this readinessProbe?
On v3.0.1 is not clear that loki-stack
is a technical preview
Hi team, i found a nasty bug on fluent-bit 1.18.13, when a file rotates fluent-bit has the wrong hash and the new file is not added to the tailing
The problem is fixed in the next patch version, see:
To improve monitoring on packages, we need to add some prometheus alerts on Kibana package
fluentd is lacking a proper livenessProbe.
See
for implementation ideas.
By default, the logging operator configures Fluentd to print logs in a file inside the container.
When doing troubleshooting activities, those logs are not available in Kibana/Loki nor by querying them with kubectl logs
or stern
. The only way is to use kubectl exec
to enter inside the pod and print the file to stdout. e.g.:
kubectl exec -it logging-demo-fluentd-0 cat /fluentd/log/out
As said in the documentation, we should add the stdout filter in Flow
and ClusterFlow
resources :
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
name: kubernetes
namespace: logging
spec:
filters:
- stdout: {}
>= 2.0.0
Metrics exposed by cerebro such as:
are not container aware and thus highly misleading (I've digged into the angularjs and play scala source code to confirm this).
Another problem is that there is no way of viewing any kind of historical data regarding metrics exposed by cerebro.
My proposed solution is to enable kibana xpack monitoring: "xpack.monitoring.collection.enabled" : "true" with container aware metrics: "xpack.monitoring.ui.container.elasticsearch.enabled" : "true" (the second setting should be set to true by default in the ES docker container)
This change will enable the "Monitoring" tab in Kibana and will create 2 special indices with 1shard and 1replica:
This indices come with an Index Lifecycle Policy (ILM) attached and are automatically deleted after 7 days.
In the basic license version you cannot set a retention that exceed 7 days.
If lower retention is needed a curator job that takes in consideration that there is already an ILM attached to the indices will need to be created.
When the PR #115 is accepted, we should backport this change to the release-v2.0 branch
With the standard nginx parser, there is a lot of parsing error when parsing logs from nginx ingress controller. We need to fix/update this behaviour
On https://github.com/sighupio/fury-kubernetes-logging/releases/tag/v3.0.0 release description, under the upgrade guide, it's missing the reference to delete the fluentd
StatefulSet and the fluentbit
DaemonSet
Hello!
the Fluentbit version, which is 1.9.10, currently set in logging-operated is still present the following bug.
Simply put, the CRI parser will ignore and not output time field if present: this will end in the creation of an index in Opensearch with a name including a suffix equal to 1970-01-01.
The problem was solved by this commit, which was released in Fluentbit 2.0.7.
New 1.24 kubernetes version removes the master
role in favor of control-plane
role.
Audit HostTailer should be updated accordingly
The compatibility with the dual ingress is missing in the module, thanks @sbruzzese902 for finding it
In some cases we need to send log to cloudwatch from fluentd.
This feature is missing in our images since quay.io/sighup/fluentd:v1.11.5-debian-1.0.
Do you think it will be possible to have it back for the latest modules of logging v1 and v2?
Please update the CONTRIBUTING docs for version >= 3.0.0 because ElasticSearch has been deprecated.
Verify that all the tasks for the version v3 are done, check release notes and create a PR for releasing the new version
The diagram should contain logging operator and opensearch + loki outputs
Dear team,
I've opened this issue with @nutellinoit just to track this bug that hit the Logging Dash
Briefing:
I've installed this logging module at the version 2.0.0 on a eks cluster(
v1.19.15) that's have the FD at v1.5.1, In follow the list of all subv of all components:
versions:
monitoring: v1.11.1
logging: v2.0.0
ingress: v1.9.1
Describe the bug:
When I try to use the new Logging Dashboard, all queries(widgets) throw a message error. Probably the default source in the dashboard schema is incorrect for this version of the monitoring module.
In follow a screen of the error:
Sometimes curator's Cronjob hangs because of the es cluster not being healthy, ti would be useful to know it.
The issue arose due to a momentary networking issue, that brought the three es instances not to form a cluster even after the networking went back to normal.
Hi team, in the event that the new minio
instance for the error output log gathering goes down, fluentd stops sending logs.
We must find a workaround, or use a different strategy for the error output
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.