Comments (26)
Hi,
-
Check the content of
/usr/local/zentral/osquery/enroll_secret.txt
on the machine that you are trying to enroll. See if the serial number is correct and not shared by another machine in your fleet. -
You could build an event probe to see if the enrollment events are saved by Zentral, and with which machine there are associated.
from zentral.
Hi @np5
Yes you are spot on, We have 2 machines with same hardware serial number caused the issue. Thanks.
I can see Postgres db log errors sometimes like below, Will it have an impact?
ERROR: duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq"
DETAIL: Key (serial_number, source_id, version)=(0123456789, 1, 22184) already exists.
STATEMENT: INSERT INTO "inventory_machinesnapshotcommit" ("serial_number", "source_id", "version", "machine_snapshot_id", "parent_id", "last_seen", "system_uptime", "created_at") VALUES ('0123456789', 1, 22184, 1, 53900, '2017-08-31T07:43:28.426620'::timestamp, NULL, '2017-08-31T07:43:28.446949'::timestamp) RETURNING "inventory_machinesnapshotcommit"."id"
LOG: could not receive data from client: Connection reset by peer
LOG: could not receive data from client: Connection reset by peer
from zentral.
I defensively decided to catch the postgres error in the code, in case there was a race to save the same inventory data. See here. At the time, I did not think it could happen, but apparently it does. Could you tell me how often you see those lines ?
About the "Connection reset by peer", it depends on your docker startup script, the connections between the different containers and the db…
from zentral.
Hi @np5
Zentral has been running for past 4 days.
'Connection reset by peer' has been observed twice sometime after service startup on the same day.
Duplicate key violates for serial number '9876543210' has been observed many times more than 10 times in past 4 days.
from zentral.
Hi,
This serial number looks like the one used by the dummy inventory module. You should deactivate it. But before you do it, could you tell me how many Zentral workers you have, filter the inventory events for this "machine", and find the rate at which they are coming ?
from zentral.
Hi @np5
How to deactivate dummy inventory modules "0123456789","9876543210"? Archive existing enrolled machines?
I have 2 instances of zentral workers running. Rate is 2 events per minute.
from zentral.
Hi,
If you archive the machines, they will come back, because the dummy inventory client is going to simulate update events.
Edit base.json
and remove the dummy client:
"zentral.contrib.inventory": {
"prometheus_bearer_token": "CHANGE ME!!!",
"clients": [
{
"backend": "zentral.contrib.inventory.clients.dummy"
}
]
}
Then you can archive the machines.
from zentral.
Hi @np5
Thanks.
Is there a way in zentral where we can create multiple file paths at 1 stretch in probes? We have requirement for 3000 file paths per probe so manually creating 1 by 1 is time consuming.
Also we can't add exclude paths in probes?
from zentral.
I would advice you to generate a JSON probe feed based on a simple and editable input containing the 3000 paths. This JSON probe feed can then be imported in Zentral. You can even update the probe after you have updated the feed.
from zentral.
Hi @np5
Do we have an example for valid JSON probe feed with an expected structure and it should be imported under which option under zentral?
from zentral.
Start by building a probe with the Zentral GUI, then export it as a gist. You will get a valid Zentral feed with the probe inside. Modify this JSON structure to suit you need, then host it somewhere, so that Zentral can download it. You can then import it as a feed.
from zentral.
@np5 Thanks let me try it out.
We have 700+ machines enrolled till now but the events getting posted from all these machines takes more than 15 minutes to store in elastic search or reflect in kibana search.
May be elastic search cluster(4 nodes) is slow in processing events, RabbitMq store_elasticsearch queue grows exponentially in lakhs within half an hour.
Postgres db throws max connections error, We are receiving osquery log/distributed query/config request most of the times where log is more frequent. Will all these events requires DB connection to be opened and closed?
FATAL: sorry, too many clients already
We can observe lot of idle in transaction at DB for below query which causes connection not getting released from zentral
SELECT "inventory_machinesnapshot"."id", "inventory_machinesnapshot"."mt_hash", "inve
ntory_machinesnapshot"."mt_created_at", "inventory_machinesnapshot"."source_id", "inventory_machinesnapshot"."reference", "inventory_machinesnapshot"."serial_number", "inventory_machinesnapshot"."business
_unit_id", "inventory_machinesnapshot"."os_version_id", "inventory_machinesnapshot"."platform", "inventory_machinesnapshot"."system_info_id", "inventory_machinesnapshot"."type", "inventory_machinesnapshot
"."teamviewer_id", "inventory_machinesnapshot"."puppet_node_id", "inventory_machinesnapshot"."public_ip_address", "inventory_businessunit"."id", "inventory_businessunit"."mt_hash", "inventory_businessunit
"."mt_created_at", "inventory_businessunit"."source_id", "inventory_businessunit"."reference", "inventory_businessunit"."key", "inventory_businessunit"."name", "inventory_businessunit"."meta_business_unit
_id", "inventory_metabusinessunit"."id", "inventory_metabusinessunit"."name", "inventory_metabusinessunit"."created_at", "...
Also rabbitmq component closes/accept connection many times after some period of time it started,
Do we need to create rabbitmq cluster for load balancing?
>> rabbitmqctl list_queues
Listing queues
process_events 2189780
amq.gen-lne4zEL1Fz2g2JG_jxitxA 0
amq.gen-MvMLHZ58xQt2MF_rNYP8xg 0
amq.gen-SUuy4o_9SO6ZkGThCdfbjA 0
amq.gen-JHca_UDGKIGTUI8LBpOGQQ 0
store_events_elasticsearch 3763244
=ERROR REPORT==== 7-Sep-2017::12:47:31 ===
closing AMQP connection <0.598.0> (10.0.0.23:56716 -> 10.0.0.9:5672):
{writer,send_failed,{error,timeout}}
=INFO REPORT==== 7-Sep-2017::12:48:58 ===
accepting AMQP connection <0.647.0> (10.0.0.23:59638 -> 10.0.0.9:5672)
=INFO REPORT==== 7-Sep-2017::12:48:58 ===
connection <0.647.0> (10.0.0.23:59638 -> 10.0.0.9:5672): user 'guest' authenticated and granted access to vhost '/'
from zentral.
Postgres: Try with CONN_MAX_AGE in the Django DB settings.
Queue: Use the rabbitmq_exporter to export the rabbitmq metrics for prometheus. You will then see how fast you are producing events.
from zentral.
Hi @np5
CONN_MAX_AGE I tried with None value for max live period and small integer value as well but hasn't helped much. Even then we are facing Postgres DB filled with idle in transactions for MachineSnapshot query. I have explicitly timed out "idle in transactions" but then idle transactions without any query or COMMIT statement gets piled up to max connections which is weird and database goes down.
Will give try on rabbitmq_exporter but as observed using "rabbitmqctl list_queues" "store_events_elasticsearch" queue grows in lakhs within 10 mins.
Is there any solution to support for large traffic from osquery say 1000 clients hitting zentral and each enroll/config/log API where I can see each api call needs database check for latest configs/hardware serial and machine snapshot. Is it cached using @cached_property?
Also Postgres MachineSnapshotCommit table has 33000 records where same machines have been enrolled multiple times with different versions. Is it required?
I have archived the dummy clients but now Im receiving duplicate key value violates unique constraint "inventory_machinesnapshotcommit_serial_number_f9fee8fa_uniq" for existing enrolled machines.
from zentral.
How many total Zentral processes are you running ? Gunicorn processes + all the workers with the children processes.
One of the standard simple deployment of Zentral is with everything (Postgres, Elasticsearch, Gunicorn, Nginx, Kibana, Rabbitmq) on one 8GB VPS. With this configuration, without having to tweak anything, ~30 events per second can be saved and > 1000 machines kept in the inventory, with multiple inventory sources per machines.
from zentral.
Zentral processes are 2 instances. We are not running gunicorn processes instead we are using runserver. Do we need to change that to gunicorn for performance improvements?
Frequency of osquery events is for every 10 minutes we have 65000+ events published to elastic search but the catch is elastic search store queue in parallel grows events waiting in lakhs at rabbitmq.
from zentral.
Note that you can run an extra store worker without adding all the other workers. See the runworkers command. With docker-entrypoint.py, you can invoke it like that:
$ docker-compose run --rm workers runworkers --list-workers
to get a list of all the configured workers run by default (1 proc each).
To run an extra store worker, without adding all the extra workers (thus saving on postgres connections):
$ docker-compose run --rm workers runworkers 'store worker elasticsearch'
This is what happened after starting 2 extra processor workers and 2 extra store workers on the 8GB VPS (tested today with a lot of audit events):
the processor worker scales pretty well, the store worker is probably blocked by the elasticsearch server.
After all the events had been processed, I removed the extra processor workers and added one store worker. It scales pretty well too when the server is under less load:
I am not saying that the performance of the workers is great. There are many things that could be optimized, and the caching of the machine info would be one of them. But it can already scale by adding some specific workers.
from zentral.
About gunicorn:
runserver is not meant for production use:
DO NOT USE THIS SERVER IN A PRODUCTION SETTING. It has not gone through security audits or performance tests. (And that’s how it’s gonna stay. We’re in the business of making Web frameworks, not Web servers, so improving this server to be able to handle a production environment is outside the scope of Django.)
I don't think it is the core of your problem here. You seem to be able to get most of the events sent by the clients. But, use a real web application server like gunicorn.
Another important thing:
Django DEBUG settings
Never deploy a site into production with DEBUG turned on.
Did you catch that? NEVER deploy a site into production with DEBUG turned on.
Among the things that DEBUG turns on, the logging of all postgres queries. Another thing: a debug page on all 40x/50x errors. Not something you want to have in prod.
You can change this in base.json
from zentral.
Thanks @np5 for detailed description
gunicorn solves max connections issue. It's able to serve our load for now.
Yes, DEBUG was already false in production.
We are running 2 instances of runworkers in docker swarm. As you have suggested may be we will revisit for optimum no. of store workers/processors.
from zentral.
So the only problem left is the duplicate key violation ? Could you try to find out the inventory source, if it concerns a lot of machines or only a few, and how often it happens ?
We have this issue with some machines running Santa. We update the inventory on Santa preflights, but sometimes the preflight calls happen more than 10 times in a row, triggering more than 10 inventory updates concurrently, thus triggering duplicate key violations (same snapshot). Thanks to the integrity checks, this is not a critical issue, just something we should avoid. Still investigating.
from zentral.
Also we are getting frequent warnings like below,
base_extras WARNING App zentral.contrib.jamf w/o main menu config
base_extras WARNING App zentral.contrib.simplemdm w/o main menu config
base_extras WARNING App zentral.contrib.munki w/o main menu config
base_extras WARNING App zentral.contrib.nagios w/o main menu config
base_extras WARNING App zentral.contrib.osquery w/o main menu config
base_extras WARNING App zentral.contrib.santa w/o main menu config
base_extras WARNING App zentral.contrib.inventory w/o setup menu config
base_extras WARNING App zentral.core.probes w/o setup menu config
from zentral.
Maybe it should not be logged at the WARNING
level, but at the INFO
level. (done in e33eac4).
The template tag used to build the main menu looks for a main_menu_config
in the urls.py
files of the active Zentral modules. Some of them have one, and a menu is displayed (like monolith). Some of them don't.
This issue is epic…
from zentral.
Also we are facing timeout issue at gunicorn workers may be due to elasticsearch index recovery in parallel, I have increased worker timeout for now to 90 seconds. As checked from elasticsearch recovery indexes history it shows 1 recovery took 36 seconds. By default gunicorn workers timeout 30 seconds, Is elasticsearch root cause for worker timeout?
[CRITICAL] WORKER TIMEOUT (pid:184)
[CRITICAL] WORKER TIMEOUT (pid:168)
[CRITICAL] WORKER TIMEOUT (pid:169)
[CRITICAL] WORKER TIMEOUT (pid:170)
[CRITICAL] WORKER TIMEOUT (pid:181)
[CRITICAL] WORKER TIMEOUT (pid:182)
[INFO] Booting worker with pid: 193
[INFO] Booting worker with pid: 194
[INFO] Booting worker with pid: 195
[INFO] Booting worker with pid: 181
[INFO] Booting worker with pid: 182
[INFO] Booting worker with pid: 183
elasticsearch WARNING Elasticsearch index recovering
elasticsearch WARNING Elasticsearch index recovering
elasticsearch WARNING Elasticsearch index recovering
Additional warnings multiple times,
base WARNING Not Found: /..\pixfir~1\how_to_login.html
base WARNING Not Found: /web-console/ServerInfo.jsp
base WARNING Not Found: /..\pixfir~1\how_to_login.html
base WARNING Not Found: /invoker/EJBInvokerServlet
base WARNING Not Found: /invoker/JMXInvokerServlet
base WARNING Not Found: /note.txt
base WARNING Not Found: /axis-cgi/operator/param.cgi
base WARNING Not Found: /sapmc/sapmc.html
from zentral.
from zentral.
@np5 Do you mean intrusion?
from zentral.
Well, this is obviously a web crawler trying to find out what is running on you Zentral server.
from zentral.
Related Issues (20)
- docker deployment
- Enrollments (santa/osquery) can't be edited/removed once created HOT 1
- Manifest-side, munki/osquery enrollments can't use quota's/serials/udid's for filtering/restriction
- Minor, results view search field does not work when supplied bare integers, requires quoting
- http_post probe action fails with "NoneType object is not callable" HOT 2
- Restrict email invitation domain
- Adding TOTP to a local user repeatedly fails HOT 7
- Can't get syslog output working HOT 15
- deploy.py fails on running migrations while deploying v2021.2-100-g760f7d81 HOT 5
- export in zentral targets not working HOT 4
- Support for token auth in jamf
- Release notes date 2021 should be 2022
- AWS all in one HOT 1
- Docker deployment on cloud vm HOT 2
- Best way to retrieve osquery query run results from external app? HOT 3
- Feature request: metadata linkable to/inline with service accounts/api keys in users view
- Add identifier patterns for signing ID rules to API HOT 3
- Accept unusual Google identifiers for Signing ID Santa rules via ruleset API and rules/create in web interface HOT 1
- Ruleset API endpoint rejects signing IDs containing underscore and minus characters HOT 1
- backend not found HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zentral.