Giter Club home page Giter Club logo

Comments (26)

np5 avatar np5 commented on August 16, 2024

Hi,

  1. Check the content of /usr/local/zentral/osquery/enroll_secret.txt on the machine that you are trying to enroll. See if the serial number is correct and not shared by another machine in your fleet.

  2. You could build an event probe to see if the enrollment events are saved by Zentral, and with which machine there are associated.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Hi @np5

Yes you are spot on, We have 2 machines with same hardware serial number caused the issue. Thanks.

I can see Postgres db log errors sometimes like below, Will it have an impact?

ERROR:  duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq"
DETAIL:  Key (serial_number, source_id, version)=(0123456789, 1, 22184) already exists.
STATEMENT:  INSERT INTO "inventory_machinesnapshotcommit" ("serial_number", "source_id", "version", "machine_snapshot_id", "parent_id", "last_seen", "system_uptime", "created_at") VALUES ('0123456789', 1, 22184, 1, 53900, '2017-08-31T07:43:28.426620'::timestamp, NULL, '2017-08-31T07:43:28.446949'::timestamp) RETURNING "inventory_machinesnapshotcommit"."id"

LOG:  could not receive data from client: Connection reset by peer
LOG:  could not receive data from client: Connection reset by peer

from zentral.

np5 avatar np5 commented on August 16, 2024

I defensively decided to catch the postgres error in the code, in case there was a race to save the same inventory data. See here. At the time, I did not think it could happen, but apparently it does. Could you tell me how often you see those lines ?

About the "Connection reset by peer", it depends on your docker startup script, the connections between the different containers and the db…

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Hi @np5

Zentral has been running for past 4 days.

'Connection reset by peer' has been observed twice sometime after service startup on the same day.

Duplicate key violates for serial number '9876543210' has been observed many times more than 10 times in past 4 days.

from zentral.

np5 avatar np5 commented on August 16, 2024

Hi,

This serial number looks like the one used by the dummy inventory module. You should deactivate it. But before you do it, could you tell me how many Zentral workers you have, filter the inventory events for this "machine", and find the rate at which they are coming ?

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Hi @np5

How to deactivate dummy inventory modules "0123456789","9876543210"? Archive existing enrolled machines?

I have 2 instances of zentral workers running. Rate is 2 events per minute.

from zentral.

np5 avatar np5 commented on August 16, 2024

Hi,

If you archive the machines, they will come back, because the dummy inventory client is going to simulate update events.

Edit base.json and remove the dummy client:

"zentral.contrib.inventory": {
  "prometheus_bearer_token": "CHANGE ME!!!",
  "clients": [
    {
      "backend": "zentral.contrib.inventory.clients.dummy"
    }
  ]
}

Then you can archive the machines.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Hi @np5

Thanks.

Is there a way in zentral where we can create multiple file paths at 1 stretch in probes? We have requirement for 3000 file paths per probe so manually creating 1 by 1 is time consuming.

Also we can't add exclude paths in probes?

from zentral.

np5 avatar np5 commented on August 16, 2024

I would advice you to generate a JSON probe feed based on a simple and editable input containing the 3000 paths. This JSON probe feed can then be imported in Zentral. You can even update the probe after you have updated the feed.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Hi @np5

Do we have an example for valid JSON probe feed with an expected structure and it should be imported under which option under zentral?

from zentral.

np5 avatar np5 commented on August 16, 2024

Start by building a probe with the Zentral GUI, then export it as a gist. You will get a valid Zentral feed with the probe inside. Modify this JSON structure to suit you need, then host it somewhere, so that Zentral can download it. You can then import it as a feed.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

@np5 Thanks let me try it out.

We have 700+ machines enrolled till now but the events getting posted from all these machines takes more than 15 minutes to store in elastic search or reflect in kibana search.

May be elastic search cluster(4 nodes) is slow in processing events, RabbitMq store_elasticsearch queue grows exponentially in lakhs within half an hour.

Postgres db throws max connections error, We are receiving osquery log/distributed query/config request most of the times where log is more frequent. Will all these events requires DB connection to be opened and closed?

FATAL:  sorry, too many clients already

We can observe lot of idle in transaction at DB for below query which causes connection not getting released from zentral

SELECT "inventory_machinesnapshot"."id", "inventory_machinesnapshot"."mt_hash", "inve
ntory_machinesnapshot"."mt_created_at", "inventory_machinesnapshot"."source_id", "inventory_machinesnapshot"."reference", "inventory_machinesnapshot"."serial_number", "inventory_machinesnapshot"."business
_unit_id", "inventory_machinesnapshot"."os_version_id", "inventory_machinesnapshot"."platform", "inventory_machinesnapshot"."system_info_id", "inventory_machinesnapshot"."type", "inventory_machinesnapshot
"."teamviewer_id", "inventory_machinesnapshot"."puppet_node_id", "inventory_machinesnapshot"."public_ip_address", "inventory_businessunit"."id", "inventory_businessunit"."mt_hash", "inventory_businessunit
"."mt_created_at", "inventory_businessunit"."source_id", "inventory_businessunit"."reference", "inventory_businessunit"."key", "inventory_businessunit"."name", "inventory_businessunit"."meta_business_unit
_id", "inventory_metabusinessunit"."id", "inventory_metabusinessunit"."name", "inventory_metabusinessunit"."created_at", "...

Also rabbitmq component closes/accept connection many times after some period of time it started,
Do we need to create rabbitmq cluster for load balancing?

>> rabbitmqctl list_queues
Listing queues
process_events	2189780
amq.gen-lne4zEL1Fz2g2JG_jxitxA	0
amq.gen-MvMLHZ58xQt2MF_rNYP8xg	0
amq.gen-SUuy4o_9SO6ZkGThCdfbjA	0
amq.gen-JHca_UDGKIGTUI8LBpOGQQ	0
store_events_elasticsearch	3763244
=ERROR REPORT==== 7-Sep-2017::12:47:31 ===
closing AMQP connection <0.598.0> (10.0.0.23:56716 -> 10.0.0.9:5672):
{writer,send_failed,{error,timeout}}

=INFO REPORT==== 7-Sep-2017::12:48:58 ===
accepting AMQP connection <0.647.0> (10.0.0.23:59638 -> 10.0.0.9:5672)

=INFO REPORT==== 7-Sep-2017::12:48:58 ===
connection <0.647.0> (10.0.0.23:59638 -> 10.0.0.9:5672): user 'guest' authenticated and granted access to vhost '/'

from zentral.

np5 avatar np5 commented on August 16, 2024

Postgres: Try with CONN_MAX_AGE in the Django DB settings.

Queue: Use the rabbitmq_exporter to export the rabbitmq metrics for prometheus. You will then see how fast you are producing events.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Hi @np5

CONN_MAX_AGE I tried with None value for max live period and small integer value as well but hasn't helped much. Even then we are facing Postgres DB filled with idle in transactions for MachineSnapshot query. I have explicitly timed out "idle in transactions" but then idle transactions without any query or COMMIT statement gets piled up to max connections which is weird and database goes down.

Will give try on rabbitmq_exporter but as observed using "rabbitmqctl list_queues" "store_events_elasticsearch" queue grows in lakhs within 10 mins.

Is there any solution to support for large traffic from osquery say 1000 clients hitting zentral and each enroll/config/log API where I can see each api call needs database check for latest configs/hardware serial and machine snapshot. Is it cached using @cached_property?

Also Postgres MachineSnapshotCommit table has 33000 records where same machines have been enrolled multiple times with different versions. Is it required?

I have archived the dummy clients but now Im receiving duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq" for existing enrolled machines.

from zentral.

np5 avatar np5 commented on August 16, 2024

How many total Zentral processes are you running ? Gunicorn processes + all the workers with the children processes.

One of the standard simple deployment of Zentral is with everything (Postgres, Elasticsearch, Gunicorn, Nginx, Kibana, Rabbitmq) on one 8GB VPS. With this configuration, without having to tweak anything, ~30 events per second can be saved and > 1000 machines kept in the inventory, with multiple inventory sources per machines.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Zentral processes are 2 instances. We are not running gunicorn processes instead we are using runserver. Do we need to change that to gunicorn for performance improvements?

Frequency of osquery events is for every 10 minutes we have 65000+ events published to elastic search but the catch is elastic search store queue in parallel grows events waiting in lakhs at rabbitmq.

from zentral.

np5 avatar np5 commented on August 16, 2024

Note that you can run an extra store worker without adding all the other workers. See the runworkers command. With docker-entrypoint.py, you can invoke it like that:

$ docker-compose run --rm workers runworkers --list-workers

to get a list of all the configured workers run by default (1 proc each).

To run an extra store worker, without adding all the extra workers (thus saving on postgres connections):

$ docker-compose run --rm workers runworkers 'store worker elasticsearch'

This is what happened after starting 2 extra processor workers and 2 extra store workers on the 8GB VPS (tested today with a lot of audit events):

screen shot 2017-09-12 at 23 05 16

the processor worker scales pretty well, the store worker is probably blocked by the elasticsearch server.

After all the events had been processed, I removed the extra processor workers and added one store worker. It scales pretty well too when the server is under less load:

screen shot 2017-09-13 at 00 09 43

I am not saying that the performance of the workers is great. There are many things that could be optimized, and the caching of the machine info would be one of them. But it can already scale by adding some specific workers.

from zentral.

np5 avatar np5 commented on August 16, 2024

About gunicorn:

runserver is not meant for production use:

DO NOT USE THIS SERVER IN A PRODUCTION SETTING. It has not gone through security audits or performance tests. (And that’s how it’s gonna stay. We’re in the business of making Web frameworks, not Web servers, so improving this server to be able to handle a production environment is outside the scope of Django.)

I don't think it is the core of your problem here. You seem to be able to get most of the events sent by the clients. But, use a real web application server like gunicorn.

Another important thing:

Django DEBUG settings

Never deploy a site into production with DEBUG turned on.

Did you catch that? NEVER deploy a site into production with DEBUG turned on.

Among the things that DEBUG turns on, the logging of all postgres queries. Another thing: a debug page on all 40x/50x errors. Not something you want to have in prod.

You can change this in base.json

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Thanks @np5 for detailed description

gunicorn solves max connections issue. It's able to serve our load for now.

Yes, DEBUG was already false in production.

We are running 2 instances of runworkers in docker swarm. As you have suggested may be we will revisit for optimum no. of store workers/processors.

from zentral.

np5 avatar np5 commented on August 16, 2024

So the only problem left is the duplicate key violation ? Could you try to find out the inventory source, if it concerns a lot of machines or only a few, and how often it happens ?

We have this issue with some machines running Santa. We update the inventory on Santa preflights, but sometimes the preflight calls happen more than 10 times in a row, triggering more than 10 inventory updates concurrently, thus triggering duplicate key violations (same snapshot). Thanks to the integrity checks, this is not a critical issue, just something we should avoid. Still investigating.

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Also we are getting frequent warnings like below,

base_extras WARNING App zentral.contrib.jamf w/o main menu config
base_extras WARNING App zentral.contrib.simplemdm w/o main menu config
base_extras WARNING App zentral.contrib.munki w/o main menu config
base_extras WARNING App zentral.contrib.nagios w/o main menu config
base_extras WARNING App zentral.contrib.osquery w/o main menu config
base_extras WARNING App zentral.contrib.santa w/o main menu config
base_extras WARNING App zentral.contrib.inventory w/o setup menu config
base_extras WARNING App zentral.core.probes w/o setup menu config

from zentral.

np5 avatar np5 commented on August 16, 2024

Maybe it should not be logged at the WARNING level, but at the INFO level. (done in e33eac4).

The template tag used to build the main menu looks for a main_menu_config in the urls.py files of the active Zentral modules. Some of them have one, and a menu is displayed (like monolith). Some of them don't.

This issue is epic…

from zentral.

erdarun avatar erdarun commented on August 16, 2024

Also we are facing timeout issue at gunicorn workers may be due to elasticsearch index recovery in parallel, I have increased worker timeout for now to 90 seconds. As checked from elasticsearch recovery indexes history it shows 1 recovery took 36 seconds. By default gunicorn workers timeout 30 seconds, Is elasticsearch root cause for worker timeout?

[CRITICAL] WORKER TIMEOUT (pid:184)
[CRITICAL] WORKER TIMEOUT (pid:168)
[CRITICAL] WORKER TIMEOUT (pid:169)
[CRITICAL] WORKER TIMEOUT (pid:170)
[CRITICAL] WORKER TIMEOUT (pid:181)
[CRITICAL] WORKER TIMEOUT (pid:182)

[INFO] Booting worker with pid: 193
[INFO] Booting worker with pid: 194
[INFO] Booting worker with pid: 195
[INFO] Booting worker with pid: 181
[INFO] Booting worker with pid: 182
[INFO] Booting worker with pid: 183

elasticsearch WARNING Elasticsearch index recovering
elasticsearch WARNING Elasticsearch index recovering
elasticsearch WARNING Elasticsearch index recovering

Additional warnings multiple times,

base WARNING Not Found: /..\pixfir~1\how_to_login.html
base WARNING Not Found: /web-console/ServerInfo.jsp
base WARNING Not Found: /..\pixfir~1\how_to_login.html
base WARNING Not Found: /invoker/EJBInvokerServlet
base WARNING Not Found: /invoker/JMXInvokerServlet
base WARNING Not Found: /note.txt
base WARNING Not Found: /axis-cgi/operator/param.cgi
base WARNING Not Found: /sapmc/sapmc.html

from zentral.

np5 avatar np5 commented on August 16, 2024

from zentral.

erdarun avatar erdarun commented on August 16, 2024

@np5 Do you mean intrusion?

from zentral.

np5 avatar np5 commented on August 16, 2024

Well, this is obviously a web crawler trying to find out what is running on you Zentral server.

from zentral.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.