Hi, We have split up zentral components using docker swarm version 1

Hi, Check the content of <code class="notransla

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

zentralopensource,zentral

Comments (26)

np5 commented on August 16, 2024

Hi,

Check the content of /usr/local/zentral/osquery/enroll_secret.txt on the machine that you are trying to enroll. See if the serial number is correct and not shared by another machine in your fleet.
You could build an event probe to see if the enrollment events are saved by Zentral, and with which machine there are associated.

from zentral.

erdarun commented on August 16, 2024

Hi @np5

Yes you are spot on, We have 2 machines with same hardware serial number caused the issue. Thanks.

I can see Postgres db log errors sometimes like below, Will it have an impact?

ERROR:  duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq"
DETAIL:  Key (serial_number, source_id, version)=(0123456789, 1, 22184) already exists.
STATEMENT:  INSERT INTO "inventory_machinesnapshotcommit" ("serial_number", "source_id", "version", "machine_snapshot_id", "parent_id", "last_seen", "system_uptime", "created_at") VALUES ('0123456789', 1, 22184, 1, 53900, '2017-08-31T07:43:28.426620'::timestamp, NULL, '2017-08-31T07:43:28.446949'::timestamp) RETURNING "inventory_machinesnapshotcommit"."id"

LOG:  could not receive data from client: Connection reset by peer
LOG:  could not receive data from client: Connection reset by peer

from zentral.

np5 commented on August 16, 2024

I defensively decided to catch the postgres error in the code, in case there was a race to save the same inventory data. See here. At the time, I did not think it could happen, but apparently it does. Could you tell me how often you see those lines ?

About the "Connection reset by peer", it depends on your docker startup script, the connections between the different containers and the db…

from zentral.

erdarun commented on August 16, 2024

Hi @np5

Zentral has been running for past 4 days.

'Connection reset by peer' has been observed twice sometime after service startup on the same day.

Duplicate key violates for serial number '9876543210' has been observed many times more than 10 times in past 4 days.

from zentral.

np5 commented on August 16, 2024

Hi,

This serial number looks like the one used by the dummy inventory module. You should deactivate it. But before you do it, could you tell me how many Zentral workers you have, filter the inventory events for this "machine", and find the rate at which they are coming ?

from zentral.

erdarun commented on August 16, 2024

Hi @np5

How to deactivate dummy inventory modules "0123456789","9876543210"? Archive existing enrolled machines?

I have 2 instances of zentral workers running. Rate is 2 events per minute.

from zentral.

np5 commented on August 16, 2024

Hi,

If you archive the machines, they will come back, because the dummy inventory client is going to simulate update events.

Edit base.json and remove the dummy client:

"zentral.contrib.inventory": {
  "prometheus_bearer_token": "CHANGE ME!!!",
  "clients": [
    {
      "backend": "zentral.contrib.inventory.clients.dummy"
    }
  ]
}

Then you can archive the machines.

from zentral.

erdarun commented on August 16, 2024

Hi @np5

Thanks.

Is there a way in zentral where we can create multiple file paths at 1 stretch in probes? We have requirement for 3000 file paths per probe so manually creating 1 by 1 is time consuming.

Also we can't add exclude paths in probes?

from zentral.

np5 commented on August 16, 2024

I would advice you to generate a JSON probe feed based on a simple and editable input containing the 3000 paths. This JSON probe feed can then be imported in Zentral. You can even update the probe after you have updated the feed.

from zentral.

erdarun commented on August 16, 2024

Hi @np5

Do we have an example for valid JSON probe feed with an expected structure and it should be imported under which option under zentral?

from zentral.

np5 commented on August 16, 2024

Start by building a probe with the Zentral GUI, then export it as a gist. You will get a valid Zentral feed with the probe inside. Modify this JSON structure to suit you need, then host it somewhere, so that Zentral can download it. You can then import it as a feed.

from zentral.

erdarun commented on August 16, 2024

@np5 Thanks let me try it out.

We have 700+ machines enrolled till now but the events getting posted from all these machines takes more than 15 minutes to store in elastic search or reflect in kibana search.

May be elastic search cluster(4 nodes) is slow in processing events, RabbitMq store_elasticsearch queue grows exponentially in lakhs within half an hour.

Postgres db throws max connections error, We are receiving osquery log/distributed query/config request most of the times where log is more frequent. Will all these events requires DB connection to be opened and closed?

FATAL:  sorry, too many clients already

We can observe lot of idle in transaction at DB for below query which causes connection not getting released from zentral

SELECT "inventory_machinesnapshot"."id", "inventory_machinesnapshot"."mt_hash", "inve
ntory_machinesnapshot"."mt_created_at", "inventory_machinesnapshot"."source_id", "inventory_machinesnapshot"."reference", "inventory_machinesnapshot"."serial_number", "inventory_machinesnapshot"."business
_unit_id", "inventory_machinesnapshot"."os_version_id", "inventory_machinesnapshot"."platform", "inventory_machinesnapshot"."system_info_id", "inventory_machinesnapshot"."type", "inventory_machinesnapshot
"."teamviewer_id", "inventory_machinesnapshot"."puppet_node_id", "inventory_machinesnapshot"."public_ip_address", "inventory_businessunit"."id", "inventory_businessunit"."mt_hash", "inventory_businessunit
"."mt_created_at", "inventory_businessunit"."source_id", "inventory_businessunit"."reference", "inventory_businessunit"."key", "inventory_businessunit"."name", "inventory_businessunit"."meta_business_unit
_id", "inventory_metabusinessunit"."id", "inventory_metabusinessunit"."name", "inventory_metabusinessunit"."created_at", "...

Also rabbitmq component closes/accept connection many times after some period of time it started,
Do we need to create rabbitmq cluster for load balancing?

>> rabbitmqctl list_queues
Listing queues
process_events	2189780
amq.gen-lne4zEL1Fz2g2JG_jxitxA	0
amq.gen-MvMLHZ58xQt2MF_rNYP8xg	0
amq.gen-SUuy4o_9SO6ZkGThCdfbjA	0
amq.gen-JHca_UDGKIGTUI8LBpOGQQ	0
store_events_elasticsearch	3763244

=ERROR REPORT==== 7-Sep-2017::12:47:31 ===
closing AMQP connection <0.598.0> (10.0.0.23:56716 -> 10.0.0.9:5672):
{writer,send_failed,{error,timeout}}

=INFO REPORT==== 7-Sep-2017::12:48:58 ===
accepting AMQP connection <0.647.0> (10.0.0.23:59638 -> 10.0.0.9:5672)

=INFO REPORT==== 7-Sep-2017::12:48:58 ===
connection <0.647.0> (10.0.0.23:59638 -> 10.0.0.9:5672): user 'guest' authenticated and granted access to vhost '/'

from zentral.

np5 commented on August 16, 2024

Postgres: Try with CONN_MAX_AGE in the Django DB settings.

Queue: Use the rabbitmq_exporter to export the rabbitmq metrics for prometheus. You will then see how fast you are producing events.

from zentral.

erdarun commented on August 16, 2024

Hi @np5

CONN_MAX_AGE I tried with None value for max live period and small integer value as well but hasn't helped much. Even then we are facing Postgres DB filled with idle in transactions for MachineSnapshot query. I have explicitly timed out "idle in transactions" but then idle transactions without any query or COMMIT statement gets piled up to max connections which is weird and database goes down.

Will give try on rabbitmq_exporter but as observed using "rabbitmqctl list_queues" "store_events_elasticsearch" queue grows in lakhs within 10 mins.

Is there any solution to support for large traffic from osquery say 1000 clients hitting zentral and each enroll/config/log API where I can see each api call needs database check for latest configs/hardware serial and machine snapshot. Is it cached using @cached_property?

Also Postgres MachineSnapshotCommit table has 33000 records where same machines have been enrolled multiple times with different versions. Is it required?

I have archived the dummy clients but now Im receiving duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq" for existing enrolled machines.

from zentral.

np5 commented on August 16, 2024

How many total Zentral processes are you running ? Gunicorn processes + all the workers with the children processes.

One of the standard simple deployment of Zentral is with everything (Postgres, Elasticsearch, Gunicorn, Nginx, Kibana, Rabbitmq) on one 8GB VPS. With this configuration, without having to tweak anything, ~30 events per second can be saved and > 1000 machines kept in the inventory, with multiple inventory sources per machines.

from zentral.

erdarun commented on August 16, 2024

Zentral processes are 2 instances. We are not running gunicorn processes instead we are using runserver. Do we need to change that to gunicorn for performance improvements?

Frequency of osquery events is for every 10 minutes we have 65000+ events published to elastic search but the catch is elastic search store queue in parallel grows events waiting in lakhs at rabbitmq.

from zentral.

np5 commented on August 16, 2024

Note that you can run an extra store worker without adding all the other workers. See the runworkers command. With docker-entrypoint.py, you can invoke it like that:

$ docker-compose run --rm workers runworkers --list-workers

to get a list of all the configured workers run by default (1 proc each).

To run an extra store worker, without adding all the extra workers (thus saving on postgres connections):

$ docker-compose run --rm workers runworkers 'store worker elasticsearch'

This is what happened after starting 2 extra processor workers and 2 extra store workers on the 8GB VPS (tested today with a lot of audit events):

the processor worker scales pretty well, the store worker is probably blocked by the elasticsearch server.

After all the events had been processed, I removed the extra processor workers and added one store worker. It scales pretty well too when the server is under less load:

I am not saying that the performance of the workers is great. There are many things that could be optimized, and the caching of the machine info would be one of them. But it can already scale by adding some specific workers.

from zentral.

np5 commented on August 16, 2024

About gunicorn:

runserver is not meant for production use:

DO NOT USE THIS SERVER IN A PRODUCTION SETTING. It has not gone through security audits or performance tests. (And that’s how it’s gonna stay. We’re in the business of making Web frameworks, not Web servers, so improving this server to be able to handle a production environment is outside the scope of Django.)

I don't think it is the core of your problem here. You seem to be able to get most of the events sent by the clients. But, use a real web application server like gunicorn.

Another important thing:

Django DEBUG settings

Never deploy a site into production with DEBUG turned on.

Did you catch that? NEVER deploy a site into production with DEBUG turned on.

Among the things that DEBUG turns on, the logging of all postgres queries. Another thing: a debug page on all 40x/50x errors. Not something you want to have in prod.

You can change this in base.json

from zentral.

erdarun commented on August 16, 2024

Thanks @np5 for detailed description

gunicorn solves max connections issue. It's able to serve our load for now.

Yes, DEBUG was already false in production.

We are running 2 instances of runworkers in docker swarm. As you have suggested may be we will revisit for optimum no. of store workers/processors.

from zentral.

np5 commented on August 16, 2024

So the only problem left is the duplicate key violation ? Could you try to find out the inventory source, if it concerns a lot of machines or only a few, and how often it happens ?

We have this issue with some machines running Santa. We update the inventory on Santa preflights, but sometimes the preflight calls happen more than 10 times in a row, triggering more than 10 inventory updates concurrently, thus triggering duplicate key violations (same snapshot). Thanks to the integrity checks, this is not a critical issue, just something we should avoid. Still investigating.

from zentral.

erdarun commented on August 16, 2024

Also we are getting frequent warnings like below,

base_extras WARNING App zentral.contrib.jamf w/o main menu config
base_extras WARNING App zentral.contrib.simplemdm w/o main menu config
base_extras WARNING App zentral.contrib.munki w/o main menu config
base_extras WARNING App zentral.contrib.nagios w/o main menu config
base_extras WARNING App zentral.contrib.osquery w/o main menu config
base_extras WARNING App zentral.contrib.santa w/o main menu config
base_extras WARNING App zentral.contrib.inventory w/o setup menu config
base_extras WARNING App zentral.core.probes w/o setup menu config

from zentral.

np5 commented on August 16, 2024

Maybe it should not be logged at the WARNING level, but at the INFO level. (done in e33eac4).

The template tag used to build the main menu looks for a main_menu_config in the urls.py files of the active Zentral modules. Some of them have one, and a menu is displayed (like monolith). Some of them don't.

This issue is epic…

from zentral.

erdarun commented on August 16, 2024

Also we are facing timeout issue at gunicorn workers may be due to elasticsearch index recovery in parallel, I have increased worker timeout for now to 90 seconds. As checked from elasticsearch recovery indexes history it shows 1 recovery took 36 seconds. By default gunicorn workers timeout 30 seconds, Is elasticsearch root cause for worker timeout?

[CRITICAL] WORKER TIMEOUT (pid:184)
[CRITICAL] WORKER TIMEOUT (pid:168)
[CRITICAL] WORKER TIMEOUT (pid:169)
[CRITICAL] WORKER TIMEOUT (pid:170)
[CRITICAL] WORKER TIMEOUT (pid:181)
[CRITICAL] WORKER TIMEOUT (pid:182)

[INFO] Booting worker with pid: 193
[INFO] Booting worker with pid: 194
[INFO] Booting worker with pid: 195
[INFO] Booting worker with pid: 181
[INFO] Booting worker with pid: 182
[INFO] Booting worker with pid: 183

elasticsearch WARNING Elasticsearch index recovering
elasticsearch WARNING Elasticsearch index recovering
elasticsearch WARNING Elasticsearch index recovering

Additional warnings multiple times,

base WARNING Not Found: /..\pixfir~1\how_to_login.html
base WARNING Not Found: /web-console/ServerInfo.jsp
base WARNING Not Found: /..\pixfir~1\how_to_login.html
base WARNING Not Found: /invoker/EJBInvokerServlet
base WARNING Not Found: /invoker/JMXInvokerServlet
base WARNING Not Found: /note.txt
base WARNING Not Found: /axis-cgi/operator/param.cgi
base WARNING Not Found: /sapmc/sapmc.html

from zentral.

np5 commented on August 16, 2024

Someone is knocking at the door…

…

On 14. Sep 2017, at 10:18, erdarun ***@***.***> wrote: Additional warnings multiple times, base WARNING Not Found: /..\pixfir~1\how_to_login.html base WARNING Not Found: /web-console/ServerInfo.jsp base WARNING Not Found: /..\pixfir~1\how_to_login.html base WARNING Not Found: /invoker/EJBInvokerServlet base WARNING Not Found: /invoker/JMXInvokerServlet base WARNING Not Found: /note.txt base WARNING Not Found: /axis-cgi/operator/param.cgi base WARNING Not Found: /sapmc/sapmc.html — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from zentral.

erdarun commented on August 16, 2024

@np5 Do you mean intrusion?

from zentral.

np5 commented on August 16, 2024

Well, this is obviously a web crawler trying to find out what is running on you Zentral server.

from zentral.

Docker swarm cluster - Postgres DB duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq" about zentral HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent